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ABSTRACT 



This invention relates to customized electronic identification 
of desirable objects, such as news articles, in an electronic 
media environment, and in particular to a system that 
automatically constructs both a "target profile" for each 
target object in the electronic media based, for example, on 
the frequency with which each word appears in an article 
relative to its overall frequency of use in all articles, as well 
as a "target profile interest summary" for each user, which 
target profile interest summary describes the user's interest 
level in various types of target objects. The system then 
evaluates the target profiles against the users* target profile 
interest summaries to generate a user-customized rank 
ordered listing of target objects most likely to be of interest 
to each user so that the user can select from among these 
potentially relevant target objects, which were automatically 
selected by this system from the plethora of target objects 
that are profiled on the electronic media. Users' target profile 
interest summaries can be used to efficiently organize the 
distribution of information in a large scale system consisting 
of many users interconnected by means of a communication 
network. Additionally, a cryptographically-based pseud- 
onym proxy server is provided to ensure the privacy of a 
user's target profile interest summary, by giving the user 
control over the ability of third parties to access this sum- 
mary and to identify or contact the user. 

15 Claims, 13 Drawing Sheets 
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IDENTIFICATION OF DESIRABLE OBJECTS fc , ^ ...... 

CROSS-REFERENCE TO RELATED lon ^^ E*SStk^— " 

APPUCATIONS 5 S^ui .he plethora of information. Mtocommer- 

™- ,,n, Plication was originally filed as Provisional cia feauon of communication networks, such as the ^temet, 

Tms patent »?P h ^X60/fB2 461 on Dec. 9. 1996 and mc of available information has increased. Custom - 

Patent Apphcation Ser. No_ 60/032 ,461 ««. gro iofonnalion dclivcty procC ss to the user s 

!^S?Sh£ 29 JSmS Ui Pat. No. 5,758, ^e lasteS and interests is the ultimate *co|oto 
^'^l^YSTEM AITO METHOD FOR SCHED- 10 prob W However, the techniques wb.ch have been pro- 

assignee as the present apphcaUoo. 15 regard, no one to date has 

FIELD OF INVENTION successfully addressed these t^^^J^fl^Z 

«. mention relates to custom^ and proved a sys.rn.at ^ -J ^ 

cation of desirable objects, ^ » ^ ,„ S^oommctcW context, such as on-line services avau- 

clectronic media environment, and » lnteneL ^ * a nC ed for an information 

that automatical constmctsboth a "» fa „ ly or entirely passive, 

target object in the electronic media based, for example on ct ™ * y undernanding of the user, and yet both precise 
^frequency with which each word W^^^* ^p^TvTin inability to learn and truly represent 
relative to its overall frequency of use in all articles, as weu ana comp imeresls . Present information retrieval 

2 a "target profile interest summary" for eacb ^, vj* 25 *™g t0 dfy me desij ed information 

urge, profikinterest summary deserves the user ^ intent ^Xior through cumbersome interfaces, 
level in various types of Urge objects. The s ^ ™° rec e ive information on a computer network 

evaluates the target profiles against the use* target profik Use* ; ma^ec ioforina tion or by passively 

interest summaries to generate * ^j^ 10 "^™ M ^STa^ miation ih.l sent to them. Just as users of 
ordered listing of target objects most likely to be ot interest 30 recemng , ioi em of ^ much 

Teach user so that the user can select from among these informa ion ^^T*ta are targeted with electronic 

'selected by mis system from the Pj^^S *£^t£ef^«^ advertising, both by 
^^^■^^^^SSS ^Si^ should not be 
server is provided to ensure the privacy of use s ^target ^^^^^IbXc effort to finding efficient 

or contact the user. » of mfonnalion relri eval are based on keyword 
PROBLEM 

a user to access information of relevance and interest^ The mforma ^ ^ ^ 

user without requiring the user to «^«^ ~usly unreliable, as users may not think of the right 

amount of time and energy searching for the ^orai^on. * Aeke ^ ordsmay be used in unwanted articles 

Electronic media, such as on-line information soirees pn> ^eywo^ or ^ ^ a the 

vide a vast amount of information to users, typically in the so * ™ '™ computers retrieve many articles 

form of "articles," each of which comprises a pubtoUon « fo — „ me ^ logical combination of 

item or document that relates to a specific tope. _ The whicb ar^ , ^ et help 

difficulty with electronic media is that the amoun of info - & ot keyword searching but do not 

mation available to the user is overwhelming and ***** me problem of inaccurate search resu ts. 

depository systems that are connected on-line are not orga- 55 ^^^^^ ^te approach to infonnaUoo 

„Ll in a manner mat sufficiently simphfies ^c«* to ordy Starting ,n ^ 

the articles of interest to the user. Presently, a i» » her taUs re ^ infonnatlOD they wanted, or o 
,o access relevant articles because they; quanS/how close the information contained »^ 

tified or expends a significant amount of tune and energj >io qua y cfa was descnbe d by a 

conduct an exhaustive search of all articles 'otdentiJy those 60 w*, o wh t «h^y ^ & ^ ^ ^ 

most likely to be of interest to the ««\ Fta ^™^ J ^ 1 . ffie or, in more advanced systems, a table of wo d 

if the user conducts an exhaustive search, pmen mfotma art ^ e ^ the artic , e . since , mcaS ure of simUanty 

tion searching techniques do not necessarily accu ately faquen ^ ^ b£tween profiles the 

extract only the most relevant arUcks, ^ 'l^S 65 measured simUarity of artide proves can used m article 

articles of marginal relevance due to the functional hnnto 65 meas ^ ^ ^ f ^ ataon OD 

uons of the information searchuig techmques. Ttere b -to n*™ write a short description of the desired mfor- 

no existing system which automatically estimates the inher a suDjea 
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mation. The information retrieval computer generates an 
article profile for the request and then retrieves articles with 
profiles similar to the profile generated for the request. These 
requests can then be refined using "relevance feedback", 
where the user actively or passively rates the articles 
retrieved as to how close the information contained therein 
is to what is desired. The information retrieval computer 
then uses this relevance feedback information to refine the 
request profile and the process is repeated until the user 
either finds enough articles or tires of the search. 

A number of researchers have looked at methods for 
selecting articles of most interest to users. An article titled 
"Social Information filtering: algorithms for automating 
'word of mouth'" was published at the CHi-95 Proceedings 
by Patti Maes et al and describes the Ringo information 
retrieval system which recommends musical selections. The 
Ringo system requires active feedback from the users — 
users must manually specify how much they like or dislike 
each musical selection. The Ringo system maintains a 
complete list of users ratings of music selections and makes 
recommendations by finding which selections were liked by 
multiple people. However, the Ringo system does not take 
advantage of any available descriptions of the music, such as 
structured descriptions in a data base, or free text, such as 
that contained in music reviews. An article titled "Evolving 
agents for personalized information filtering", published at 
the Proc. 9th IEEE Conf. on AI for Applications by Sheth 
and Maes, described the use of agents for information 
filtering which use genetic algorithms to learn to categorize 
Usenet news articles. In this system, users must define news 
categories and the users actively indicate their opinion of the 
selected articles. Their system uses a list of keywords to 
represent sets of articles and the records of users' interests 
are updated using genetic algorithms. 

A number of other research groups have looked at the 
automatic generation and labeling of clusters of articles for 
the purpose of browsing through the articles. A group at 
Xerox Pare published a paper titled "Scatter/gather: a 
cluster-based approach to browsing large article collections*' 
at the 15 Ann. Int'l SIGIR '92, ACM 318-329 (Cutting et al. 
1992). This group developed a method they call "scatter/ 
gather" for performing information retrieval searches. In this 
method, a collection of articles is "scattered" into a small 
number of clusters, the user then chooses one or more of 
these clusters based on short summaries of the cluster. The 
selected clusters are then "gathered" into a subcollection, 
and then the process is repeated. Each iteration of this 
process is expected to produce a small, more focused 
collection. The cluster "summaries'* are generated by pick- 
ing those words which appear most frequently in the cluster 
and the titles of those articles closest to the center of the 
cluster. However, no feedback from users is collected or 
stored, so no performance improvement occurs over time. 

Apple's Advanced Technology Group has developed an 
interface based on the concept of a "pile of articles". This 
interface is described in an article titled "A 'pile* metaphor 
for supporting casual organization of information in Human 
factors in computer systems" published in CHI '92 Conf. 
Proc. 627-634 by Mander, R. G. Salomon and Y. Wong. 
1992. Another article titled "Content awareness in a file 
system interface: implementing the 'pile' metaphor for orga- 
nizing information" was published in 16 Ann. Int'l SIGIR 
'93, ACM 260-269 by Rose E. D. et al. The Apple interface 
uses word frequencies to automatically file articles by pick- 
ing the pile most similar to the article being filed. This 
system functions to cluster articles into subpiles, determine 
key words for indexing by picking the words with the largest 
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TF/IDF (where TF is term (word) frequency and IDF is the 
inverse document frequency) and label piles by using the 
determined key words. 

Numerous patents address information retrieval methods, 

5 but none develop records of a user's interest based on 
passive monitoring of which articles the user accesses. None 
of the systems described in these patents pre sent computer 
architectures to allow fast retrieval of articles distributed 
across many computers. None of the systems described in 

10 these patents address issues of using such article retrieval 
and matching methods for purposes of commerce or of 
matching users with common interests or developing records 
of users* interests. U.S. PaL No. 5321,833 issued to Chang 
et aL teaches a method in which users choose terms to use 

15 in an information retrieval query, and specify the relative 
weightings of the different terms. The Chang system then 
calculates multiple levels of weighting criteria. U.S. Pat. No. 
5301,109 issued to Landauer et al. teaches a method for 
retrieving articles in a multiplicity of languages by con- 

20 structing "latent vectors" (SVD or PCA vectors) which 
represent correlations between the different words. U.S. Pat. 
No. 5,331,554 issued to Graham et al. discloses a method for 
retrieving segments of a manual by comparing a query with 
nodes in a decision tree. U.S. Pat. No. 5,331,556 addresses 

25 techniques for deriving morphological part-of-speech infor- 
mation and thus to make use of the similarities of different 
forms of the same word (e.g. "article" and "articles"). 

Therefore, there presently is no information retrieval and 
delivery system operable in an electronic media environ- 

30 ment that enables a user to access information of relevance 
and interest to the user without requiring the user to expend 
an excessive amount of time and energy. 

SOLUTION 

35 

The above-described problems are solved and a technical 
advance achieved in the field by the system for customized 
electronic identification of desirable objects in an electronic 
media environment, which system enables a user to access 

40 target objects of relevance and interest to the user without 
requiring the user to expend an excessive amount of time 
and energy. Profiles of the target objects are stored on 
electronic media and are accessible via a data communica- 
tion network. In many applications, the target objects are 

45 informational in nature, and so may themselves be stored on 
electronic media and be accessible via a data communication 
network. 

Relevant definitions of terms for the purpose of this 
description include: (a.) an object available for access by the 

50 user, which may be either physical or electronic in nature, is 
termed a "target object", (b.) a digitally represented profile 
indicating that target object's attributes is termed a "target 
profile", (c.) the user looking for the target object is termed 
a "user", (d.) a profile holding that user's attributes, inctud- 

55 ing age/zip code/etc. is termed a "user profile", (e.) a 
summary of digital profiles of target objects that a user likes 
and/or dislikes, is termed the "target profile interest sum- 
mary" of that user, (f.) a profile consisting of a collection of 
attributes, such that a user likes target objects whose profiles 

60 are similar to this collection of attributes, is termed a "search 
profile" or in some contexts a "query" or "query profile," (g.) 
a specific embodiment of the target profile interest summary 
which comprises a set of search profiles is termed the 
"search profile set" of a user, (h.) a collection of target 

65 objects with similar profiles, is termed a "cluster,** (i.) an 
aggregate profile formed by averaging the attributes of all tar 
get objects in a cluster, termed a "cluster profile," (j.) a real 
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number determined by calculating the statistical variance of 
the profiles of all target objects in a cluster, is termed a 
"cluster variance," (k.) a real number determined by calcu- 
lating the maximum distance between the profiles of any two 
target objects in a cluster, is termed a "cluster diameter." 

The system for electronic identification of desirable 
objects of the present invention automatically constructs 
both a target profile for each target object in the electronic 
media based, for example, on the frequency with which each 
word appears in an article relative to its overall frequency of 
use in all articles, as well as a "target profile interest 
summary" for each user, which target profile interest sum- 
mary describes the user's interest level in various types of 
target objects. The system then evaluates the target profiles 
against the users' target profile interest summaries to gen- 
erate a user-customized rank ordered listing of target objects 
most likely to be of interest to each user so that the user can 
select from among these potentially relevant target objects, 
which were automatically selected by this system from the 
plethora of target objects available on the electronic media. 

Because people have multiple interests, a target profile 
interest summary for a single user must represent multiple 
areas of interest, for example, by consisting of a set of 
individual search profiles, each of which identifies one of the 
user's areas of interest Each user is presented with those 
target objects whose profiles most closely match the user's 
interests as described by the user's target profile interest 
summary. Users* target profile interest summaries are auto- 
matically updated on a continuing basis to reflect each user's 
changing interests. In addition, target objects can be grouped 
into clusters based on their similarity to each other, for 
example, based on similarity of their topics in the case where 
the target objects are published articles, and menus auto- 
matically generated for each cluster of target objects to allow 
users to navigate throughout the clusters and manually 
locate target objects of interest. For reasons of confidenti- 
ality and privacy, a particular user may not wish to make 
public all of the interests recorded in the user's target profile 
interest summary, particularly when these interests are deter- 
mined by the user's purchasing patterns. The user may 
desire that all or part of the target profile interest summary 
be kept confidential, such as information relating to the 
user's political, religious, financial or purchasing behavior; 
indeed, confidentiality with respect to purchasing behavior 
is the user's legal right in many states. It is therefore 
necessary that data in a user's target profile interest summary 
be protected from unwanted disclosure except with the 
user's agreement. At the same time, the user's target profile 
interest summaries must be accessible to the relevant servers 
that perform the matching of target objects to the users, if the 
benefit of this matching is desired by both providers and 
consumers of the target objects. The disclosed system pro- 
vides a solution to the privacy problem by using a proxy 
server which acts as an intermediary between the informa- 
tion provider and the user. The proxy server dissociates the 
user's true identity from the pseudonym by the use of 
cryptographic techniques. The proxy server also permits 
users to control access to their target profile interest sum- 
maries and/or user profiles, including provision of this 
information to marketers and advertisers if they so desire, 
possibly in exchange for cash or other considerations. Mar- 
keters may purchase these profiles in order to target adver- 
tisements to particular users, or they may purchase partial 
user profiles, which do not include enough information to 
identify the individual users in question, in order to carry out 
standard kinds of demographic analysis and market research 
on the resulting database of partial user profiles. Pseudony- 
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mous control of an information server suggests how a 
special discount can be issued to a user's pseudonym and 
that such a digital credential is provided to the user as a 
result of his/her user profile making him/her eligible. The 

5 user may thus present this type of credential to the appro- 
priate vendor to take advantage of the discount This tech- 
nique can be extended also to smart cards wherein the digital 
credential providing the discount is downloaded from the 
client to the smart card and upon presentation, the vendor 

10 may if desired, delete the credential upon redemption by the 
user. These discount credentials may similarly include any 
of the discount types (customized promotions) herein dis- 
closed wherein each purchase may identified (characterized) 
and credentialized by the vendor onto the user's smart card 

15 and/or the vendor's system. 

In the preferred embodiment of the invention, the system 
for customized electronic identification of desirable objects 
uses a fundamental methodology for accurately and effi- 
ciently matching users and target objects by automatically 

20 calculating, using and updating profile information that 
describes both the users' interests and the target objects' 
characteristics. The target objects may be published articles, 
purchasable items, or even other people, and their properties 
are stored, and/or represented and/or denoted on the elec- 

25 tronic media as (digital) data. Examples of target objects can 
include, but are not limited to: a newspaper story of potential 
interest, a movie to watch, an item to buy, e-mail to receive, 
or another person to correspond with. In one suggested 
application, the user is a sender of email (which may have 

30 originated from the user for or from another external source 

. such as from outside of a large organization) and the target 
objects are users who might be considered most appropriate 
based upon previous messages which they have received, 
read and responded to. Accordingly, like other target objects, 

35 users (or user pseudonyms) in accordance with their user 
profiles (or portions of which they have disclosed) may be 
organized and browsed within an automatically generated 
menu tree, which is below described in detail. In all these 
cases, the information delivery process in the preferred 

40 embodiment is based on determining the similarity between 
a profile for the target object and the profiles of target objects 
for which the user (or a similar user) has provided positive 
feedback in the past The individual data that describe a 
target object and constitute the target object's profile are 

45 herein termed "attributes'* of the target object Attributes 
may include, but are not limited to, the following: (1) long 
pieces of text (a newspaper story, a movie review, a product 
description or an advertisement), (2) short pieces of text 
(name of a movie's director, name of town from which an 

50 advertisement was placed, name of the language in which an 
article was written), (3) numeric measurements (price of a 
product, rating given to a movie, reading level of a book), (4) 
associations with other types of objects (list of actors in a 
movie, list of persons who have read a document). Any of 

55 these attributes, but especially the numeric ones, may cor- 
relate with the quality of the target object, such as measures 
of its popularity (how often it is accessed) or of user 
satisfaction (number of complaints received). 

The preferred embodiment of the system for customized 

60 electronic identification of desirable objects operates in an 
electronic media environment for accessing these target 
objects, which may be news, electronic mail, other pub- 
lished documents, or product descriptions. The system in its 
broadest construction comprises three conceptual modules, 

65 which may be separate entities distributed across many 
implementing systems, or combined into a lesser subset of 
physical entities. The specific embodiment of this system 
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disclosed herein illustrates the use of a first module which 
automatically constructs a "target profile" for each target 
object in the electronic media based on various descriptive 
attributes of the target object. A second module uses interest 
feedback from users to construct a "target profile interest 5 
summary" for each user, for example in the form of a "search 
profile set" consisting of a plurality of search profiles, each 
of which corresponds to a single topic of high interest for the 
user. The system further includes a profile processing mod- 
ule which estimates each user's interest in various target 10 
objects by reference to the users' target profile interest 
summaries, for example by comparing the target profiles of 
these target objects against the search profiles in users' 
search profile sets, and generates for each user a customized 
rank-ordered listing of target objects most likely to be of 15 
interest to that user. Each user's target profile interest 
summary is automatically updated on a continuing basis to 
reflect the user's changing interests. 

Target objects may be of various sorts, and it is sometimes 
advantageous to use a single system that delivers and/or 20 
clusters target objects of several distinct sorts at once, in a 
unified framework. For example, users who exhibit a strong 
interest in certain novels may also show an interest in certain 
movies, presumably of a similar nature. A system in which 
some target objects are novels and other target objects are 25 
movies can discover such a correlation and exploit it in order 
to group particular novels with particular movies, e.g., for 
clustering purposes, or to recommend the movies to a user 
who has demonstrated interest in the novels. Similarly, if 
users who exhibit an interest in certain World Wide Web 30 
sites also exhibit an interest in certain products, the system 
can match the products with the sites and thereby recom- 
mend to the marketers of those products that they place 
advertisements at those sites, e.g., in the form of hypertext 
links to their own sites. The presently described system 3s 
explains the techniques for target advertising (on a user by 
user basis) through both links from advertisements on a web 
page which tends to be visited by the most likely buyers of 
that particular product or service, and routing advertise- 
ments to such users via email. (This assumes that be cause 40 
user visitorship is measured at the level of the web page, 
certain pages within the web site may be more appropriate 
for certain advertisements due to the slight differences in its 
visitorship. Text chat(or acoustic voice chat) using a text to 
speech conversion module may be used in conjunction with 45 
real time profiling of the real time user dialogues occurring 
within that chat session. Advertisements which are relevant 
nature of the content being discussed at present may provide 
temporary links to the appropriate product such that when 
the nature of the content changes the advertisements changes 50 
(may disappear) accordingly. 

The ability to measure the similarity of profiles describing 
target objects and a user's interests can be applied in two 
basic ways: filtering and browsing. Filtering is useful when 
large numbers of target objects are described in the elec- 55 
tronic medias pace. These target objects can for example be 
articles that are received or potentially received by a user, 
who only has time to read a small fraction of them. For 
example, one might potentially receive all items on the AP 
news wire service, all items posted to a number of news 60 
groups, all advertisements in a set of newspapers, or all 
unsolicited electronic mail, but few people have the time or 
inclination to read so many articles. A filtering system in the 
system for customized electronic identification of desirable 
objects automatically selects a set of articles that the user is 65 
likely to wish to read. The accuracy of this filtering system 
improves over time by noting which articles the user reads 
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and by generating a measurement of the depth to which the 
user reads each article. This information is then used to 
update the user's target profile interest summary. Browsing 
provides an alternate method of selecting a small subset of 
a large number of target objects, such as articles. Articles are 
organized so that users can actively navigate among groups 
of articles by moving from one group to a larger, more 
general group, to a smaller, more specific group, or to a 
closely related group. Each individual article forms a one- 
member group of its own, so that the user can navigate to 
and from individual article s as well as larger groups. The 
methods used by the system for customized electronic 
identification of desirable objects allow articles to be 
grouped into clusters and the clusters to be grouped and 
merged into larger and larger clusters. These hierarchies of 
clusters then form the basis for menuing and navigational 
systems to allow the rapid searching of large numbers of 
articles. This same clustering technique is applicable to any 
type of target objects that can be profiled on the electronic 
media such as product selections within a menu or through- 
out the World Wide Web. 

There are a number of variations on the theme of devel- 
oping and using profiles for article retrieval. Variations of 
this basic system are disclosed and comprise a system to 
filter electronic mail, an extension for retrieval of target 
objects such as purchasable items which may have more 
complex descriptions, a system to automatically build and 
alter menuing systems for browsing and searching through 
large numbers of target objects, and a system to construct 
virtual communities of people with common interests. These 
intelligent filters and browsers are necessary to provide a 
truly passive, intelligent system interface. A user interface 
that permits intuitive browsing and filtering represents for 
the first time an intelligent system for determining the 
affinities between users and target objects. The detailed, 
comprehensive target profiles and user-specific target profile 
interest summaries enable the system to provide responsive 
routing of specific queries for user information access. The 
information maps so produced and the application of users' 
target profile interest summaries to predict the information 
consumption patterns of a user allows for pre-caching of 
data at locations on the data communication network and at 
times that minimize the traffic flow in the communication 
network to thereby efficiently provide the desired informa- 
tion to the user and/or conserve valuable storage space by 
only storing those target objects (or segments thereof) which 
are relevant to the user's interests. 

BRIEF DESCRIPTION OF THE DRAWING 

FIG. 1 illustrates in block diagram form a typical archi- 
tecture of an electronic media system in which the system 
for customized electronic identification of desirable objects 
of the present invention can be implemented as part of a user 
server system; 

FIG. 2 illustrates in block diagram form one embodiment 
of the system for customized electronic identification of 
desirable objects; 

FIGS. 3 and 4 illustrate typical network trees; 

FIG. 5 illustrates in flow diagram form a method for 
automatically generating article profiles and an associated 
hierarchical menu system; 

FIGS. 6-9 illustrate examples of menu generating pro- 
cess; 

FIG. 10 illustrates in flow diagram form the operational 
steps taken by the system for customized electronic identi- 
fication of desirable objects to screen articles for a user; 
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FIG. 11 illustrates a hierarchical cluster tree example; 

FIG. 12 illustrates in flow diagram form the process for 
determination of likelihood of interest by a specific user in 
a selected target object; 

FIGS. 13A-B illustrate in flow diagram form the auto- 
matic clustering process; 

FIG. 14 illustrates in flow diagram form the use of the 
pseudonymous server; 

FIG. 15 illustrates in flow diagram form the use of the 
system for accessing information in response to a user 
query; and 

FIG. 16 illustrates in flow diagram form the use of the 
system for accessing information in response to a user query 
when the system is a distributed network implementation. 

DETAILED DESCRIPTION 

MEASURING SIMILARITY 

This section describes a general procedure for automati- 
cally measuring the similarity between two target objects, or, 
more precisely, between target profiles that are automatically 
generated for each of the two target objects. This similarity 
determination process is applicable to target objects in a 
wide variety of contexts. Target objects being compared can 
be, as an example but not limited to: textual documents, 
human beings, movies, or mutual funds. It is assumed that 
the target profiles which describe the target objects are 
stored at one or more locations in a data communication 
network on data storage media associated with a computer 
system. 

The computed similarity measurements serve as input to 
additional processes, which function to enable human users 
to locate desired target objects using a large computer 
system. These additional processes estimate a human user's 
interest in various target objects, or else cluster a plurality of 
target objects in to logically coherent groups. The methods 
used by these additional processes might in principle be 
implemented on either a single computer or on a computer 
network. Jointly or separately, they form the underpinning 
for various sorts of database systems and information 
retrieval systems. 
Target Objects and Attributes 

In classical Information Retrieval (IR) technology, the 
user is a literate human and the target objects in question are 
textual documents stored on data storage devices intercon- 
nected to the user via a computer network. That is, the target 
objects consist entirely of text, and so are digitally stored on 
the data storage devices within the computer network. 
However, there are other target object domains that present 
related retrieval problems that are not capable of being 
solved by present information retrieval technology which 
are applicable to targeting of articles and advertisements to 
readers of an on-line newspaper: 

(a.) the user is a film buff and the target objects are movies 

available on videotape, 
(b.) the user is a consumer and the target objects are used 

cars being sold, 
(c.) the user is a consumer and the target objects are 

products being sold through promotional deals, 
(d.) the user is an investor and the target objects are 

publicly traded stocks, mutual funds and/or real estate 

properties. 

(e.) the user is a student and the target objects are classes 
being offered. 

(f.) the user is an activist and the target objects are 
Congressional bills of potential concern. 
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(g.) the user is about to send an e-mail message and the 

target objects are potential recipients who are interested 

in the content of that message, 
(h.) the user is a corporate receptionist receiving incoming 
5 e-mail, voice mail or live telephone calls and the target 

objects are the employees which are the most qualified 

to handle those incoming media, 
(i.) the user is a net-surfer and the target objects are links 

to pages, servers, or newsgroups available on the World 
10 Wide Web which are linked from pages and articles in 

the on-line newspaper, 
(j.) the user is a philanthropist and the target objects are 

charities. 

15 (k.) the user is ill and the target objects are ads for medical 
specialists. 

(1.) the user is an employee and the target objects are 

classifieds for potential employers, 
(m.) the user is an employer and the target objects are 
20 classifieds for potential employees. 

(n.) the user is a lonely heart and the target objects are 

classifieds for potential conversation partners, 
(o.) the user is in search of an expert and the target objects 
2S are users, with known retrieval habits, of an document 
retrieval system, 
(p.) the user is in need of insurance and the target objects 

are classifieds for insurance policy offers. 
In all these cases, the user wishes to locate some small 
30 subset of the target objects — such as the target objects that 
the user most desires to rent, buy, investigate, meet, read, 
give mammograms to, insure, and so forth. The task is to 
help the user identify the most interesting target objects, 
where the user's interest in a target object is defined to be a 
35 numerical measurement of the user's relative desire to locate 
that object rather than others. 

The generality of this problem motivates a general 
approach to solving the information retrieval problems noted 
above. It is assumed that many target objects are known to 
4Q the system for customized electronic identification of desir- 
able objects, and that specifically, the system stores (or has 
the ability to reconstruct) several pieces of information 
about each target object. These pieces of information are 
termed "attributes": 
45 collectively, they are said to form a profile of the target 
object, or a "target profile." For example, where the system 
for customized electronic identification of desirable objects 
is activated to identify selections of interest in a particular 
category of on-line products for review or purchase by the 
50 user, it can be appreciated that there are certain unique sets 
of attributes which are pertinent to the particular product 
category of choice. For the application as part of a movie 
critic column (where the system identifies novel titles and 
reviews which are most interesting to the user) the system is 
55 likely be concerned with the values of attributes such as 
these: 

(a.) title of movie, 
(b.) name of director, 

(c.) Motion Picture Association of America (MPAA) 
60 child-appropriateness rating (0=G, 1=PG, . . . ), 
(d.) date of release, 

(e.) number of stars granted by a particular critic, 
(f.) number of stars granted by a second critic, 
65 (g.) number of stars granted by a third critic, 

For example, a customized financial news column may be 
presented to the user in tbe form of articles which are of 
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interest to the user. Id this case, however, an accordingly 
Sosf sto^which are most interesting to the user may be 
presented as well. 

(b.). full text of review by the third critic, 

(i.). list of customers who have previously rented this 5 
movie, 

(i/i list of actors. . 
Each movie has a different set of values for these 
a.tSutes ?Thls example conveniently illustrates three kinds 
of attributes. Attributes c^ are numeric attributes, of the 10 
sort that might be found in a database record. It is evident 
S they cafbe used to help the user identify target objecK 
Tmovi J) of interest. For example, the user rmght previously 
£ve rented many Parental Guidance (PG) films, and many 
films made in the 1970's. This gererahzation is ^-new 15 
films with values for one or both attributes that are numen- 
cally similar to these (such as MPAA rating of 1, release date 
of 1975) are judged similar to the films the user already 
UkesVand therefore of probable interest. Attributes a-* and 
fare' textual attributes. They too are i^t fer hdpuig 20 
the user locate desired films. For example perhaps the user 
has shown a past interest in films whose review tex 
faTtributTh) contains words like "chase," "explos.on, 
Explosions " "hero," "gripping," and "superb " Tn* gen- 
eraLtion is again useful in identifying new films of inter- 25 
e .^tribute il an associative attribute. It records asser- 
tions between the target objects in this domain, namely 
movies, and ancillary target objects of an entirely different 
sXntmely humans. A good indication, that the user wane 
to rent " particular movie is that the user has previously 30 
ented other movies with similar attribute values and th* 
holds for attribute I just as it does for attributes a-h^Fo 
example, if the user has often liked moves that customer 
C Tand customer C 190 have rented, then the user may hke 
other such movies, which have similar va!ues or attribute, 35 
Attribute j is another example of an associative attnbute, 
recording associations between target obj«= te / nd .^ o 
Notice that any of these attributes can be made subject to 
authentication when the profile is constructed, through the 
2 of digital signatures; for example the target obje* cortd 40 
be accompanied by a digitally signed note f rorn the MPA£ 
which note names the target object and specifies its authentic 
value for attribute c. 

These three kinds of attributes are common: numeric, 
textual, and associative. In the classical information retrieval 45 
problem, where the target objects are document (or more 
generally, coherent document sections extracted by a text 
segmentation method), the system ^.^"T*^ 
single, textual attribute when measuring 
texfof the target object However, a more sophisticated 50 
Sem would consider a longer target profile, including 
numeric and associative attributes: 
(a.) full text of document (textual), 

(b.) title (textual), 5S 
(c.) author (textual), 

(d.) language in which document is written (textual), 
(e.) date of creation (numeric), 
(f.) date of last update (numeric), 

(g.) length in words (numeric), 60 
(b.) reading level (numeric), 

(i.) quality of document as rated by a third party editorial 

agency (numeric), .... 
(j i list of other readers who have retrieved this document 

As ( Se a r domain example, consider a domain where the 
user is an advertiser and the target objects are potential 



customers. The system might store the following attributes 
for each target object (potential customer): 
(a.) first two digits of zip code (textual), 
(b.) first three digits of zip code (textual), 
(c) entire five-digit zip code (textual), 
(d.) distance of residence from advertiser's nearest physi- 
cal storefront (numeric), 
(e.) annual family income (numeric), 
(f ) number of children (numeric), 
(g.) list of previous items purchased by this potential 

customer (associative), 
(h.) list of filenames stored on this potential customer's 

client computer (associative), 
(i.) list of movies rented by this potential customer 
(associative), 

Q.) list of investments in this potential customer's invest- 
ment portfolio (associative), 
(k.) list of documents retrieved by this potential customer 
(associative), 

(1) written response to Rorschach inkblot test (textual), 
(m.) multiple^hoice responses by this customer to 20 

self-image questions (20 textual attributes). 
As always, the notion is that similar """n*"^ 
similar products. It should be noted that diverse sorts of 
information are being used here to *«act^consi 1 me« 
from their consumption patterns to their literary taste s ana 
psychological peculiarities, and that this fact illustrates both 
mTSlity and power of the system for customized 
ftecScTdentificaU^n of desirable objects of the present 
Mention. Diverse sorts of information car , be i*ed as 
attributes in other domains as well (as when physical 
economic, psychological and interest-related questions are 
u£d to profile the applicants to a dating service which _is 
indeed a possible domain for the present system), and the 
advertiser domain is simply an example. 

As a final domain example, consider a domain where the 
user is an stock market investor and the target objec* are 
publicly traded corporations. A great many attribute migb 
be used to characterize each corporation, including but not 
limited to the following: 
(a.) type of business (textual), 
(b ) corporate mission statement (textual), 
(a) number of employees during each of the last 10 years 

(ten separate numeric attributes), 
(d.) percentage growth in number of employees during 

each of the last 10 years, 
(e) dividend payment issued in each of the last 40 

quarters, as a percentage of current share price, 
(f .) percentage appreciation of stock value during each of 
the last 40 quarters, list of shareholders (associative), 
(g.) composite text of recent articles about the corporation 
in the financial press (textual). 
For example, a customized financial news column may be 
presented to the user in the form of articles which are of 
invest to the user. In addition, those stocks which are most 
interestine to the user may be presented as well. 

It t worm noting some additional attributes that are of 
interest in some domains. In the case of documents and 
certain other domains, it is useful to know the source of each 
target object (for example, refereed journa article vs. UPI 
newswiri article vs. Usenet newsgroup posting vs. question- 
answer pair from a question-and-answer list vs. tabknd 
newspaper article vs. . . . ); the source may be represented 
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as a single-term textual attribute. Important associative 
attributes for a hypertext document are the list of documents 
that it links to, and the list of documents that link to it. 
Documents with similar citations are similar with respect to 
the former attribute, and documents that are cited in the 
same places are similar with respect to the latter. A conven- 
tion may optionally be adopted that any document also links 
to itself. Especially in systems where users can choose 
whether or not to retrieve a target object, a target object's 
popularity (or circulation) can be usefully measured as a 
numeric attribute specifying the number of users who have 
retrieved that object. Related measurable numeric attributes 
that also indicate a kind of popularity include the number of 
replies to a target object, in the domain where target objects 
are messages posted to an electronic community such as an 
computer bulletin board or newsgroup, and the number of 
links leading to a target object, in the domain where target 
objects are interlinked hypertext documents on the World 
Wide Web or a similar system. A target object may also 
receive explicit numeric evaluations (another kind of 
numeric attribute) from various groups, such as the Motion 
Picture Association of America (MPAA), as above, which 
rates movies* appropriateness for children, or the American 
Medical Association, which might rate the accuracy and 
novelty of medical research papers, or a random survey 
sample of users (chosen from all users or a selected set of 
experts), who could be asked to rate nearly anything. Certain 
other types of evaluation, which also yield numeric 
attributes, may be carried out mechanically. For example, 
the difficulty of reading a text can be assessed by standard 
procedures that count word and sentence lengths, while the 
vulgarity of a text could be defined as (say) the number of 
vulgar words it contains, and the expertise of a text could be 
crudely assessed by counting the number of similar texts its 
author had previously retrieved and read using the invention, 
perhaps confining this count to texts that have high approval 
ratings from critics. Finally, it is possible to synthesize 
certain textual attributes mechanically, for example to recon- 
struct the script of a movie by applying speech recognition 
techniques to its soundtrack or by applying optical character 
recognition techniques to its closed-caption subtitles. 
Decomposing Complex Attributes 

Although textual and associative attributes are large and 
complex pieces of data, for information retrieval purposes 
they can be decomposed into smaller, simpler numeric 
attributes. This means that any set of attributes can be 
replaced by a (usually larger) set of numeric attributes, and 
hence that any profile can be represented as a vector of 
numbers denoting the values of these numeric attributes. In 
particular, a textual attribute, such as the full text of a movie 
review, can be replaced by a collection of numeric attributes 
that represent scores to denote the presence and significance 
of the words "aardvark," "aback," "abacus," and so on 
through "zymurgy" in that text. The score of a word in a text 
may be defined in numerous ways. The simplest definition is 
that the score is the rate of the word in the text, which is 
computed by computing the number of times the word 
occurs in the text, an d dividing this number by the total 
number of words in the text. This sort of score is often called 
the "term frequency" (TF) of the word. The definition of 
term frequency may optionally be modified to weight dif- 
ferent portions of the text unequally: for example, any 
occurrence of a word in the text's title might be counted as 
a 3-fold or more generally k-fold occurrence (as if the title 
had been repeated k times within the text), in order to reflect 
a heuristic assumption that the words in the title are par- 
ticularly important indicators of the text's content or topic. 



19,195 

14 

However, for lengthy textual attributes, such as the text of 
an entire document, the score of a word is typically defined 
to be not merely its term frequency, but its term frequency 
multiplied by the negated logarithm of the word's "global 

5 frequency," as measured with respect to the textual attribute 
in question. The global frequency of a word, which effec- 
tively measures the word's uninformativeness, is a fraction 
between 0 and 1, defined to be the fraction of all target 
objects for which the textual attribute in question contains 

to this word. This adjusted score is often known in the art as 
TF/IDF ("term frequency times inverse document 
frequency"). When global frequency of a word is taken into 
account in this way, the common, uninformative words have 
scores comparatively close to zero, no matter how often or 

15 rarely they appear in the text. Thus, their rate has little 
influence on the object's target profile. Alternative methods 
of calculating word scores include latent semantic indexing 
or probabilistic models. 

Instead of breaking the text into its component words, one 

20 could alternatively break the text into overlapping word 
bigrams (sequences of 2 adjacent words), or more generally, 
word n-grams. These word n-grams may be scored in the 
same way as individual words. Another possibility is to use 
character n-grams. For example, this sentence contains a 

25 sequence of overlapping character 5-grams which starts "for 
e'\ "or ex", "r exa", "exam", "examp", etc. The sentence 
may be characterized, imprecisely but usefully, by the score 
of each possible character 5-gram ("aaaaa", "aaaab", . . . 
"zzzzz") in the sentence. Conceptually speaking, in the 

30 character 5-gram case, the textual attribute would be decom- 
posed into at least 26 5 -l 1,881 ,376 numeric attributes. Of 
course, for a given target object, most of these numeric 
attributes have values of 0, since most 5-grams do not appear 
in the target object attributes. These zero values need not be 

35 stored anywhere. For purposes of digital storage, the value 
of a textual attribute could be characterized by storing the set 
of character 5-grams that actually do appear in the text, 
together with the nonzero score of each one. Any 5 -gram 
that is no t included in the set can be assumed to have a score 

40 of zero. The decomposition of textual attributes is not 
limited to attributes whose values are expected to be long 
texts. A simple, one-term textual attribute can be replaced by 
a collection of numeric attributes in exactly the same way. 
Consider again the case where the target objects are movies. 

45 The "name of director" attribute, which is textual, can be 
replaced by numeric attributes giving the scores for 
"Federico-Fellini," "Woody-Allen," "Terence-Davies," and 
so forth, in that attribute. For these one-term textual 
attributes, the score of a word is usually defined to be its rate 

50 in the text, without any consideration of global frequency. 
Note that under these conditions, one of the scores is 1, 
while the other scores are 0 and need not be stored. For 
example, if Davies did direct the film, then it is "Terence- 
Davies" whose score is 1, since "Terence-Davies" consti- 

55 tutes 100% of the words in the textual value of the "name of 
director" attribute. It might seem that nothing has been 
gained over simply regarding the textual attribute as having 
the string value "Terence-Davies." However, the trick of 
decomposing every non-numeric attribute into a collection 

60 of numeric attributes proves useful for the clustering and 
decision tree methods described later, which require the 
attribute values of different objects to be averaged and/or 
ordinally ranked. Only numeric attributes can be averaged or 
ranked in this way. Just as a textual attribute may be 

65 decomposed into a number of component terms (letter or 
word n-grams), an associative attribute may be decomposed 
into a number of component associations. For instance, in a 
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domain where the target objects are movies, a typical as described above into a collection of real numbers, rep- 
associative attribute used in profiling a movie would be a list resenting the scores of various word n-grams or character 
of customers who have rented that movie. This list can be n-grams in the text. Then the value V may again be regarded 
replaced by a collection of numeric attributes, which give as a vector, and the distance between two values is again 
the "association scores" between the movie and each of the 5 defined via the angle distance measure. Other similarity 
customers known to the system. For example, the 165th such metrics between two vectors, such as the dice measure, may 
numeric attribute would be me association score between the be used instead. It happens that the obvious alternative 
movie and customer #165, where the association score is metric, Euclidean distance, does not work well: even similar 
defined to be 1 if customer #165 has previously rented the texts tend not to overlap substantially in the content words 
movie, and 0 otherwise. In a subtler refinement, this asso- 10 they use, so that texts encountered in practice are all 
ciation score could be defined to be the degree of interest, substantially orthogonal to each other, assuming that TF/IDF 
possibly zero, that customer #165 exhibited in the movie, as scores are used to reduce the influence of non-content words, 
determined by relevance feedback (as described below). As The scores of two words in a textual attribute vector may be 
another example, in a domain where target objects are correlated; for example, "Kenned/* and "JFK" tend to 
companies, an associative attribute indicating the major 15 appear in the same documents. Thus it may be advisable to 
shareholders of the company would be decomposed into a alter the text somewhat before computing the scores of terms 
collection of association scores, each of which would indi- in the text, by using a synonym dictionary that groups 
cate the percentage of the company (possibly zero) owned together similar words. The effect of this optional pre- 
by some particular individual or corporate body. Just as with alteration is that two texts using related words are measured 
the term scores used in decomposing lengthy textual 20 to be as similar as if they had actually used the same words, 
attributes, each association score may optionally be adjusted One technique is to augment the set of words actually found 
by a multiplicative factor: for example, the association score in the article with a set of synonyms or other words which 
between a movie and customer #165 might be multiplied by tend to co-occur with the words m the article, so that 
the negated logarithm of the "global frequency" of customer "Kenned/* could be added to every article that mentions 
#165 i.e., the fraction of all movies that have been rented by 25 "JFK" Alternatively, words found m the article may be 
customer #165. Just as with the term scores used in decom- wholly replaced by synonyms, so that "JFK" might be 
posing textual attributes, most association scores found replaced by "Kenned/* or by "John F. Kennedy" wherever 
when decomposing a particular value of an associative it appears. In either case, the result is that documents about 
attribute are zero, and a similar economy of storage may be Kennedy and documents about JFK are adjudged similar, 
gained in exactly the same manner by storing a list of only 30 The synonym dictionary may be sensitive to the topic of the 
those ancillary objects with which the target object has a document as a whole; for example, it may recognize that 
nonzero association score, together with their respective "crane" is likely to have a different synonym in a document 
association scores. ^ x mentions birds than in a document that mentions 
Similarity Measures construction. A related technique is to replace each word by 

What does it mean for two target objects to be similar? 35 its morphological stem, so that "staple", "stapler", and 

More precisely, how should one measure the degree of "staples" are all replaced by "staple." Common function 

similarity? Many approaches are possible and any reason- words ("a", "and", "the" . . . ) c an influence the calculated 

able metric that can be computed over the set of target object similarity of texts without regard to their topics, and so are 

profiles can be used, where target objects are considered to typically removed from the text before the scores of terms in 

be similar if the distance between their profiles is small 40 the text are computed. A more general approach to recog- 

according to this metric. Thus, the following preferred nizing synonyms is to use a revised measure of the distance 

embodiment of a target object similarity measurement sys- between textual attribute vectors V and U, namely arccos 

tern has m any variations. (AV(AU)Vsqrt <AV(AV)' AU(AU», where the matrix A is 

First, define the distance between two values of a given the dimensionality-reducing linear transformation (or an 
attribute according to whether the attribute is a numeric, 45 approximation thereto) determined by collecting the vector 
associative, or textual attribute. If the attribute is numeric, values of the textual attribute, for all target objects known to 
then the distance between two values of the attribute is the the system, and applying singular value decomposition to 
absolute value of the difference between the two values. the resulting collection. The same approach can be applied 
(Other definitions are also possible: for example, the dis- to the vector values of associative attributes. The above 
tance between prices pi and p2 might be defined by |(pl- so definitions allow us to determine how close together two 
P 2)|/(max(pl,p2)+l), to recognize that when it comes to target objects are with respect to a single attribute, whether 
customer interest, $5000 and $5020 are very similar, numeric, associative, or textual. The distance between two 
whereas $3 and $23 are not.) If the attribute is associative, target objects X and Y with respect to their entire multi- 
then its value V may be decomposed as described above into attribute profiles ? x and P y is then denoted d(X,Y) or d(P^, 
a collection of real numbers, representing the association 55 Py) and defined as: 

scores between the target object in question and various (((distance with respect to attribute a)(weight of attribute 

ancillary objects. V may therefore be regarded as a vector a))*+((distance with respect to attribute bXweight of 

with components V lf V 3 , etc., representing the associa- attribute b))*+((distance with respect to attribute 

tion scores between the object and ancillary objects 1, 2, 3, c)(weight of attribute c))*+ . . . ) 

etc., respectively. The distance between two vector values V 60 where k is a fixed positive real number, typically 2, and the 

andU of an associative attribute is then computed using the weights are non-negative real numbers indicating the rela- 

angle distance measure, arccos (VU7sqrt((Vv l )(UU')). (Note five importance of the various attributes. For example, if the 

that the three inner products in this expression have the form target objects are consumer goods, and the weight of the 

XY'oX 1 Y 1 +X 2 Y 2 +X 3 Y 3 + . . . , and that for efiBcient "color" attribute is comparatively very small, then price is 

computation, terms of the form X f . Y, may be omitted from 65 not a consideration in determining similarity: a user who 
this sum if either of the scores X,- and Y, is zero.) Finally, if likes a brown massage cushion is predicted to show equal 

the attribute is textual, then its value V may be decomposed interest in the same cushion manufactured in blue, and 
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vice- versa. On the other hand, if the weight of the "color" 
attribute is comparatively very high, then users are predicted 
to show interest primarily in products whose colors they 
have liked in the past: a brown massage cushion and a blue 
massage cushion are not at all the same kind of target object, 
however similar in other attributes, and a good experience 
with one does not by itself inspire much interest in the other. 
Target objects may be of various sorts, and it is sometimes 
advantageous to use a single system that is able to compare 
tar get objects of distinct sorts. For example, in a system 
where some target objects are novels while other target 
objects are movies, it is desirable to judge a novel and a 
movie similar if their profiles show that similar users like 
them (an associative attribute). However, it is important to 
note that certain attributes specified in the movie's target 
profile are undefined in the novel's target profile, and vice 
versa: a novel has no "cast list" associative attribute and a 
movie has no "reading level*' numeric attribute. In general, 
a system in which target objects fall into distinct sorts may 
sometimes have to measure the similarity of two target 
objects for which somewhat different sets of attributes are 
defined. This requires an extension to the distance metric 
d(*,*) defined above. In certain applications, it is sufficient 
when carrying out such a comparison simply to disregard 
attributes that are not defined for both target objects: this 
allows a cluster of novels to be matched with the most 
similar cluster of movies, for example, by considering only 
those attributes that novels and movies have in common. 
However, while this method allows comparisons between 
(say) novels and movies, it does not define a proper metric 
over the combined space of novel s and movies and therefore 
does not allow clustering to be applied to the set of all target 
objects. When necessary for clustering or other purposes, a 
metric that allows comparison of any two target objects 
(whether of the same or different sorts) can be defined as 
follows. If a is an attribute, then let Max(a) be an upper 
bound on the distance between two values of attribute a; 
notice that if attribute a is an associative or textual attribute, 
this distance is an angle determined by arccos, so that 
Max(a) may be chosen to be 180 degrees, while if attribute 
a is a numeric attribute, a sufficiently large number must be 
selected by the system designers. The distance between two 
values of attribute a is given as before in the case where both 
values are defined; the distance between two undefined 
values is taken to be zero; finally, the distance between a 
defined value and an undefined value is always taken to be 
Max(a)/2. This allows us to determine how close together 
two target objects are with respect to an attribute a, even if 
attribute a does not have a defined value for both target 
objects. The distance d(V) between two target objects with 
respect to their entire multi-attribute profiles is then given in 
terms of these individual attribute distances exactly as 
before. It is assumed that one attribute in such a system 
specifies the sort of target object ("movie", "novel", etc.), 
and that this attribute may be highly weighted if target 
objects of different sorts are considered to be very different 
despite any attributes they may have in common. 

UTILIZING THE SIMILARITY MEASUREMENT 
Matching Buyers and Sellers 

A simple application of the similarity measurement is a 
system to match buyers with sellers in small-volume 
markets, such as used cars and other used goods, artwork, or 
employment. Sellers submit profiles of the goods (target 
objects) they want to sell, and buyers submit profiles of the 
goods (target objects) they want to buy. Participants may 
submit or withdraw these profiles at any time. The system 
for customized electronic identification of desirable objects 
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computes the similarities between seller-submitted profiles 
and buyer-submitted profiles, and when two profiles match 
closely (i.e., the similarity is above a threshold), the corre- 
sponding seller and buyer are notified of each other's 
5 identities. To prevent users from being flooded with 
responses, it may be desirable to limit the number of 
notifications each user receives to a fixed number, such as 
ten per day. 

Filtering: Relevance Feedback 

to A filtering system is a device that can search through 
many target objects and estimate a given user's interest in 
each target object, so as to identify those that are of greatest 
interest to the user. The filtering system uses relevance feed 
back to refine its knowledge of the user's interests: wheo- 

15 ever the filtering system identifies a target object as poten- 
tially interesting to a user, the user (if an on-line user) 
provides feedback as to whether or not that target object 
really is of interest. Such feedback is stored long-term in 
summarized form, as part of a database of user feedback 

20 information, and may be provided either actively or pas- 
sively. In active feedback, the user explicitly indicates his or 
her interest, for instance, on a scale of -2 (active distaste) 
through 0 (no special interest) to 10 (great interest). In 
passive feedback, the system infers the user's interest from 

25 the user's behavior. For example, if target objects are textual 
documents, the system might monitor which documents the 
user chooses to read, or not to read, and bow much time the 
user spends reading them. A typical formula for assessing 
interest in a document via passive feedback, in this domain, 

30 on a scale of 0 to 10, might be: 

+2 if the second page is viewed, 
+2 if all pages are viewed, 

+2 if more than 30 seconds was spent viewing the 
35 document, 

+2 if more than one minute was spent viewing the 
document, 

+2 if the minutes spent viewing the document are greater 
than half the number of pages. 

40 If the target objects are electronic mail messages, interest 
points might also be added in the case of a particularly 
lengthy or particularly prompt reply. If the target objects are 
purchasable goods, interest points might be added for target 
objects that the user actually purchases, with further points 

45 in the case of a large -quantity or high -price purchase. In any 
domain, further points might be added for target objects that 
the user accesses early in a session, on the grounds that users 
access the objects that most interest them first. Other poten- 
tial sources of passive feedback include an electronic mea- 

50 surement of the extent to which the user's pupils dilate while 
the user views the target object or a description of the target 
object. It is possible to combine active and passive feedback. 
One option is to take a weighted average of the two ratings. 
Another option is to use passive feedback by default, but to 

55 allow the user to examine and actively modify the passive 
feedback score. In the scenario above, for instance, an 
uninteresting article may sometimes remain on the display 
device for a long period while the user is engaged in 
unrelated business; the passive feedback score is then inap- 

60 propriately high, and the user may wish to correct it before 
continuing. In the preferred embodiment of the invention, a 
visual indicator, such as a sliding bar or indicator needle on 
the user's screen, can be used to continuously display the 
passive feedback score estimated by the system for the target 

65 object being viewed, unless the user has manually adjusted 
the indicator by a mouse operation or other means in order 
to reflect a different score for this target object, after which 
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the indicator displays the active feedback score selected by U generally have in target objects like X The method of 
the user, and this active feedback score is used by the system determining a user's interest relies on the following heuris- 
insteadof the passive feedback score. In a variation, the user tic when X and Y are similar target objects (have sum ar 
cannot see o, adjust the indicator until just after the user has attributes), and U and V are similar users (have similar 
finished viewing the target object. Regardless how a user's 5 attributes), then topical interest f(U, X) is predicted Uo have 
feedback is computed, it is stored long-term as part of that a similar value to the value of topical interest f(V, Y). Tins 
user's target profile interest summary. heuristic leads to an effective ^ method because esUmated 
Filtering: Determining Topical Interest Through Similarity values of the topical interest function f(*. ) are actually 
Relevance feedback only determines the user's interest in know n for certain arguments to that function: specific^ 
certain tareet objects: namely, the target objects that the user 10 if user V has provided a relevance-feedback rating of r(V, Y) 
has actually had the opportunity to evaluate (whether for target object Y, then insofar as that rating represents user 
actively or passively). For target objects that the user has not V's true interest in target object Y^re have '0^=^' 
vet seen the filtering system must estimate the user's Y>f(V, Y) and can estimate f(V, Y) as r(V, Y)-q(V, Y). 
interest This estimation task is the heart of the filtering Thus, the problem of estimating topical interest at all points 
problem and the reason that the similarity measurement is is becomes a problem of interpolating among these estimates 
important More concretely, the preferred embodiment of the of topical interest at selected points, such as the feedback 
E"m is a news clipping service that periodically estimateof f(V,Y)asr(V,Y)^(V,Y).Tliisu,lerpolatK,ncan 
presentetheuserwithnewsarticlesofpotentialinterest.The be accomplished with any standard smoothing technique 
user provides active and/or passive feedback to the system using as input the known point estuna.es of the value of be 
relating to these presented articles. However, the system 20 topical interest function fif, '), and determining as output a 
does Sot have feedback information from the user for function that approximates the entire topical interest func- 
articles that have never been presented to the user, such as tion f(*. *)• . . 
new articles that have just been added to the database, or old Not all point estimates of the topical interest function f( , 
articles that the system chose not to present to the user. ♦) should be given equal weight as inputs to the smoothing 
Similarly, in the dating service domain where target objects 25 algorithm. Since passive relevance feedback is less reliable 
are prospective romantic partners, the system has only than active relevance feedback, point estimates made from 
received feedback on old flames, not on prospective new passive relevance feedback should be weighted less heavily 
j oves than point estimates made from active relevance feedback, 
As shown in flow diagram form in FIG. 12, the evaluation or even not used at all. In most domains, a user's interests 
of the likelihood of interest in a particular target object for 30 may change over time and, therefore estimates of topic al 
a specific user can automatically be computed. The interest interest that derive from more recent feedback should also 
that a Riven target object X holds for a user U is assumed to be weighted more heavily. A user s interests may vary 
be a sim of two quantities: q(U, X), the intrinsic "quality" according to mood, so estimates of topical interest that 
of X olus ffU X) the "topical interest" that users like U derive from the current session should be weighted more 
have'in target objects like X. For any target object X, the 35 heavily for the duration of the current session, and past 
intrinsic quality measure q(U, X) is easily estimated at steps estimates of topical interest made at approximately the 
1201-1203 directly from numeric attributes of the target current time of day or on the current weekday should be 
object X. Tlie computation process begins at step 1201, weighted more heavily. Finally, in domains where users are 
where certain designated numeric attributes of target object trying to locate target objects of long-term interest 
X are specifically selected, which attributes by their very 40 (investments, romantic partners, pen pals, employers, 
nature should be positively or negatively correlated with employees, suppliers, service providers) from the possMy 
users' interest. Such attributes, termed "quality attributes," meager information provided by the target profiles, the users 
have the normative property that the higher (or in some cases are usually not in a portion to provide reliable immediate 
lower) their value, the more interesting a user is expected to feedback on a target object, but can provide rehabl ^ feed- 
find them. Quality attributes of target object X may include, 45 back at a later date. An estimate of topical interest f(V, Y) 
but are not limited to, targe, object X's popularity among should be weighted more beavdy if user V has had more 
users in general, the rating a particular reviewer has given experience with target object Y. Indeed a useful strategy a 
target object X, the age (time since authorship-also known for the system to track long-term feedback for such target 
as outdatedness) of Lget object X, the number of vulgar objects. For example, if target profile Y was created in 1990 
words used in target object X, the price of target object X, 50 to describe a particular investment that was available in 
and the amount money that the company selling target 1990, and that was purchased in 1990 by user V, then the 
object X has donated to the user's favorite charity. At step system solicits relevance feedback from user V in the years 
1202 each of the selected attributes is multiplied by a 1990, 1991, 1992, 1993, 1994, 1995 etc. and treats these as 
positive or negative weight indicative of the strength of user successively stronger indications of user V s true interest in 
U's preference for those target objects that have high values 55 target profile Y, and thus as indications of user V s likely 
for this attribute, which weight must be retrieved from a data interest in new investments whose current profiles resemble 
file storing quality attribute weights for the selected user. Al the original 1990 investment profile Y^ In particular, if in 
step 1203, a weighted sum of the identified weighted 1994 and 1995 user V is welM^osedtoward his or her 
selected attributes is computed to determine the intrinsic 1990 purchase of the investment described by target profile 
quality measure q(U, X). At step 1204, the summarized 60 Y, then in those years and later, the system tends '° recoil- 
weighted relevance feedback data isretrieved, whereinsome mend additional investments when they have profiles like 
relevance feedback points are weighted more heavily than target profile Y, on the grounds that they too will turn out to 
others and the stored relevance data can be summarized to be satisfactory in 4 to 5 years. It makes these recommen- 
some degree, for example by the use of search profile sets. dations both to user V and to users whose ^vestment 
The more difficult part of determining user U's interest in 65 portfolios and other attributes are similar to user V s. The 
target object X is to find or compute at step 1205 the value relevance feedback provided by user V in this case may be 
of f(U X) which denotes the topical interest that users like either active (feedback=satisfaction ratings provided by the 
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investor V) or passive (feedback=difference between aver- g(x)=min(l, x~*) where k>l. Estimate topical interest f(U, 

age annual return of the investment and average annual X) with the following g-weighted average: 
return of the Dow Jones index portfolio since purchase of the 

investment, for example). _ SCW, D - ¥Y* *)) * gCdistaacefKt/, x)A(v, y) 

To effectively apply the smoothing technique, it is nec- 5 /( . ) - £g(di«ance *((/, V)A(V, Y)) 
essary to have a definition of the similarity distance between 

(U, X) and (V, Y), for any users U and V and any target . ... _~ . . 

v .. v Vv «i l i j u . a r Here the summations are over all pairs (V, Y) such that 

objects X and Y We have already seen how to define the ^ y ^ £ ^ ^ Lc 

distance d(Y,Y) between two target objecte X and Y, given aU ^ ^ ^ ^ J fa 
their attnbutes. We may regard a pair such as (U, X) as an 10 defincd No(c tfaat ^ ^ ^ tcchniquc and ^ ^ 
extended object that bears all the attnbutes of target X and ventional smoothing techniques, the estimate of the topical 
all the attributes of user U; then the distance between (U, X) mterest ^ X j ^ nol neC essarily equal to r(U, X)-q(U, X), 
and (V, Y) may be computed in exactly the same way. This even WDen ^u, X) is defined, 
approach requires user U, user V, and all other users to have Filtering: Adjusting Weights and Residue Feedback 
some attributes of their own stored in the system: for is The method described above requires the filtering system 
example, age (numeric), social security number (textual), to measure distances between (user, target object) pairs, such 
and list of documents previously retrieved (associative). It is as the distance between (U, X) and (V, Y). Given the means 
these attributes that determine the notion of "similar users." described earlier for measuring the distance between two 
Thus it is desirable to generate profiles of users (termed multi-attribute profiles, the method must therefore associate 
"user profiles") as well as profiles of target objects (termed 20 a weight with each attribute used in the profile of (user, 
"target profiles"). Some attributes employed for profiling target object) pairs, that is, with each attribute used to profile 
users may be related to the attributes employed for profiling either users or target objects. These weights specify the 
target objects: for example, using associative attributes, it is relative importance of the attributes in establishing similar- 
possible to characterize target objects such as X by the ity or difference, and therefore, in determining how topical 
interest that various users have shown in them, and simul- 25 interest is generalized from one (user, target object) pair to 
taneously to characterize users such as U by the interest that another. Additional weights determine which attributes of a 
they have shown in various target objects. In addition, user target object contribute to the quality function q, and by how 
profiles may make use of any attributes that are useful in much. 

characterizing humans, such as those suggested in the It is possible and often desirable for a filtering system to 

example domain above where target objects are potential 30 store a different set of weights for each user. For example, 

consumers. Notice that user U's interest can be estimated a user who thinks of two-star films as having materially 

even if user U is a new user or an off-line user who has never different topic and style from four-star films wants to assign 

provided any feedback, because the relevance feedback of a high weight to "number of stars" for purposes of the 

users whose attributes are similar to ITs attributes is taken similarity distance measure d(*, *); this means that interest 

into account. 35 in a two-star film does not necessarily signal interest in an 

For some uses of filtering systems, when estimating otherwise similar four-star film, or vice-versa. If the user 

topical interest, it is appropriate to make an additional also agrees with the critics, and actually prefers four-star 

"presumption of no topical interest" (or "bias toward zero"). films, the user also wants to assign "number of stars" a high 

To understand the usefulness of such a presumption, suppose positive weight in the determination of the quality function 

the system needs to determine whether target object X is 40 q. In the same way, a user who dislikes vulgarity wants to 

topically interesting to the user U, but that users like user U assign the "vulgarity score" attribute a high negative weight 

have never provided feedback on target objects even in the determination of the quality function q, although the 

remotely like target object X. The presumption of no topical "vulgarity score" attribute does not necessarily have a high 

interest says that if this is so, it is because users like user U weight in determining the topical similarity of two films, 

are simply not interested in such target objects and therefore 45 Attribute weights (of both sorts) may be set or adjusted by 

do not seek them out and interact with them. On this the system administrator or the individual user, on either a 

presumption, the system should estimate topical interest f(U, temporary or a permanent basis. 

X) to be low. Formally, this example has the characteristic However, it is often desirable for the filtering system to 

that (U, X) is far away from all the points (V, Y) where leam attribute weights automatically, based on relevance 

feedback is available. In such a case, topical interest f(U, X) 50 feedback. The optimal attribute weights for a user U are 

is presumed to be close to zero, even if the value of the those that allow the most accurate prediction of user U*s 

topical interest function f(*, *) is high at all the faraway interests. That is, with the distance measure and quality 

surrounding points at which its value is known. When a function defined by these attribute weights, user IPs interest 

smoothing technique is used, such a presumption of no in target object X, q(U, X)+f(U, X), can be accurately 

topical interest can be introduced, if appropriate, by manipu- 55 estimated by the techniques above. The effectiveness of a 

lating the input to the smoothing technique. In addition to particular set of attribute weights for user U can therefore be 

using observed values of the topical interest function f(*, *) gauged by seeing how well it predicts user U's known 
as input, the trick is to also introduce fake observations of interests. 

the form topical interest f(V, Y)=0 for a lattice of points (V, Formally, suppose that user U has previously provided 

Y) distributed throughout the multidimensional space. These 60 feedback on target objects X lf X^ X 3 , . . . X„, and that the 

fake observations should be given relatively low weight as feedback ratings are r(U, Xj), r(U, XJ, r(U, X 3 ), . . . r(U, 

inputs to the smoothing algorithm. The more strongly they XJ. Values of feedback ratings r(V) for other users and 

are weighted, the stronger the presumption of no interest. other target objects may also be known. The system may use 

The following provides another simple example of an the following procedure to gauge the effectiveness of the set 

estimation technique that has a presumption of no interest. 65 of attribute weights it currently stores for user U: (I) For 

Let g be a decreasing function from non-negative real each l<=I<an, use the estimation techniques to estimate 
□umbers to non-negative real numbers, such as g(x)=e* or q(U, XjJ+fitU, X^) from all known values of feedback ratings 



05/01/2003, EAST Version: 1.03.0007 



6,029, 

23 

r. Call this estimate a,-, (ii) Repeat step (i), but this time make 
the estimate for each l<=i<=n without using the feedback 
ratings r(U, X,) as input, for any j such that the distance d(Xf, 
X,) is smaller than a fixed threshold. That is, estimate each 
q(U, X,)+f(U, X,) from other values of feedback rating r 5 
only; in particular, do not use r(U, X t ) itself Call this 
estimate b ( - The difference a ( -b t is herein termed the "resi- 
due feedback t„J\), X,) of user U on target object X, (iii) 
Compute user IPs error measure, (aj-b^+^-b^p-Ka^- 
b 3 ) 2 + . . . •Ka.-bJ 2 10 

A gradient-descent or other numerical optimization 
method may be used to adjust user U's attribute weights so 
that this error measure reaches a (local) minimum. This 
approach tends to work best if the smoothing technique used 
in estimation is such that the value of f(V, Y) is strongly 15 
affected by the point estimate r( V, Y)-q(V, Y) when the latter 
value is provided as input. Otherwise, the presence or 
absence of the single input feedback rating r(U, Xj), in steps 
(i)-(ii) may not make a ; and b,- very different from each 
other. A slight variation of this learning technique adjusts a 20 
single global set of at tribute weights for all users, by 
adjusting the weights so as to minimize not a particular 
user's error measure but rather the total error measure of all 
users. These global weights are used as a default initial 
setting for a new user who has not yet provided any 25 
feedback. Gradient descent can then be employed to adjust 
this user's individual weights over time. Even when the 
attribute weights are chosen to minimize the error measure 
for user U, the error measure is generally still positive, 
meaning that residue feedback from user U has not been 30 
reduced to 0 on all target objects. It is useful to note that high 
residue feedback from a user U on a target object X indicates 
that user U liked target object X unexpectedly well given its 
profile, that is, better than the smoothing model could 
predict from user U's opinions on target objects with similar 35 
profiles. Similarly, low residue feedback indicates that user 
U liked target object X less than was expected. By definition, 
this unexplained preference or dispreference cannot be the 
result of topical similarity, and therefore must be regarded as 
an indication of the intrinsic quality of target object X. It 40 
follows that a useful quality attribute for a target object X is 
the average amount of residue feedback r w (V, X) from users 
on that target object, averaged over all users V who have 
provided relevance feedback on the target object. In a 
variation of this idea, residue feedback is never averaged 45 
indiscriminately over all users to form a new attribute, but 
instead is smoothed to consider users* similarity to each 
other. Recall that the quality measure q(U, X) depends on the 
user U as well as the target object X, so that a given target 
object X may be perceived by different users to have 50 
different quality. In this variation, as before, q(U, X) is 
calculated as a weighted sum of various quality attributes 
that are dependent only on X, but then an additional term is 
added, namely an estimate of r ns (U, X) found by applying 
a smoothing algorithm to known values of rres (V, X). Here 55 
V ranges over all users who have provided relevance feed- 
back on target object X, and the smoothing algorithm is 
sensitive to the distances d(U, V) from each such user V to 
user U. 

Using the Similarity Computation for Clustering 60 

A method for defining the distance between any pair of 
target objects was disclosed above. Given this distance 
measure, it is simple to apply a standard clustering 
algorithm, such as k-means, to group the target objects into 
a number of clusters, in such a way that similar target objects 65 
tend to be grouped in the same cluster. It is clear that the 
resulting clusters can be used to improve the efficiency of 
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matching buyers and sellers in the application described in 
section "Matching Buyers and Sellers" above: it is not 
necessary to compare every buy profile to every sell profile, 
but only to compare buy profiles and sell profiles that are 
similar enough to appear in the same cluster. As explained 
below, the results of the clustering procedure can also be 
used to make filtering more efficient, and in the service of 
querying and browsing tasks. 

The k-means clustering method is familiar to those skilled 
in the art. Briefly put, it finds a grouping of points (target 
profiles, in this case, whose numeric coordinates are given 
by numeric decomposition of their attributes as described 
above) to minimize the distance between points in the 
clusters and the centers of the clusters in which they are 
located. This is done by alternating between assigning each 
point to the cluster which has the nearest center and then, 
once the points have been assigned, computing the (new) 
center of each cluster by averaging the coordinates of the 
points (target profiles) located in this cluster. Other cluster- 
ing methods can be used, such as "soft" or "fuzzy" k-means 
clustering, in which objects are allowed to belong to more 
than one cluster. This can be cast as a clustering problem 
similar to the k-means problem, but now the criterion being 
optimized is a little different: 

where C ranges over cluster numbers, i ranges over target 
objects, x f - is the numeric vector corresponding to the profile 
of target object number i, _C is the mean of all the numeric 
vectors corresponding to target profiles of target objects in 
cluster number C, termed the "cluster profile" of cluster C, 
d(*, *) is the metric used to measure distance between two 
target profiles, and i iC is a value between 0 and 1 that 
indicates how much target object number i is associated with 
cluster number C, where i is an indicator matrix with the 
property that for each i, SUM SUB C I SUB iC-1. For 
k-means clustering, i lC is either 0 or 1. 

Any of these basic types of clustering might be used by 
the system: 

1) Association-based clustering, in which profiles contain 
only associative attributes, and thus distance is defined 
entirely by associations. This kind of clustering gener- 
ally (a) clusters target objects based on the similarity of 
the users who like them or (b) clusters users based on 
the similarity of the target objects they like. In this 
approach, the system does not need any information 
about target objects or users, except for their history of 
interaction with each other. 

2) Content-based clustering, in which profiles contain 
only non- associative attributes. This kind of clustering 
(a) clusters target objects based on the similarity of 
their non-associative attributes (such as word 
frequencies) or (b) clusters users based on the similarity 
of their non-associative attributes (such as demograph- 
ics and psychographics). In this approach, the system 
does not need to record any information about users* 
historical patterns of information access, but it does 
need information about the intrinsic properties of users 
and/or target objects. 

3) Uniform hybrid method, in which profiles may contain 
both associative and non-associative attributes. This 
method combines la and 2a, or lb and 2b. The distance 
d(Py, P y ) between two profiles P^ and P y may be 
computed by the general similarity-measurement meth- 
ods described earlier. 
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4) Sequential hybrid method. First apply the k-meaos 
procedure to do la, so that articles are labeled by 
cluster based on which user read them, then use super- 
vised clustering (maximum likelihood discriminant 
methods) using the word frequencies to do the process 
of method 2a described above. Ibis tries to use knowl- 
edge of who read what to do a better job of clustering 
based on word frequencies. One could similarly com- 
bine the methods lb and 2b described above. 
Hierarchical clustering of target objects is often useful. 
Hierarchical clustering produces a tree which divides the 
target objects first into two large clusters of roughly similar 
objects; each of these clusters is in turn divided into two or 
more smaller clusters, which in turn are each divided into yet 
smaller clusters until the collection of target objects has been 
entirely divided into "clusters" consisting of a single object 
each, as diagrammed in FIG. 8 In this diagram, the node d 
denotes a particular target object d, or equivalently, a single 
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Searching for Target Objects 

Given a target object with target profile P, or alternatively 
given a search profile P, a hierarchical cluster tree of target 
objects makes it possible for the system to search efficiently 
for target objects with target profiles similar to P. It is only 
necessarily to navigate through the tree, automatically, in 
search of such target profiles. The system for customized 
electronic identification of desirable objects begins by con- 
sidering the largest, top-level clusters, and selects the cluster 
whose profile is most similar to target profile P. In the event 
of a near-tie, multiple clusters may be selected. Next, the 
system considers all subclusters of the selected clusters, and 
this time selects the subclusters or subclusters whose profiles 
are closest to target profile P. This refinement process is 
iterated until the clusters selected on a given step are 
sufficiently small, and these are the desired clusters of target 
objects with profiles most similar to target profile P. Any 
hierarchical cluster tree therefore serves as a decision tree 
for identifying target objects. In pseudo-code form, this 
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d is a member of the cluster (a, b, d), which is a subset of 
the cluster (a, b, c, d, e, f), which in turn is a subset of all 
target objects. The tree shown in FIG. 8 would be produced 
from a set of target objects such as those shown geometri- 
cally in FIG. 7. In FIG. 7, each letter represents a target 
object, and axes xl and x2 represent two of the many 
numeric attributes on which the target objects differ. Such a 
cluster tree may be created by hand, using human judgment 
to form clusters and subclusters of similar objects, or may be 
created automatically in either of two standard ways: top- 
down or bottom-up. In top-down hierarchical clustering, the 
set of all target objects in FIG. 7 would be divided into the 
clusters (a, b, c, d, e, f) and (g, h, i, j, k). The clustering 
algorithm would then be reapplied to the target objects in 
each cluster, so that the cluster (g, h, i, j, k) is subpartitioned 
into the clusters (g, k) and (h, i, j), and so on to arrive at the 
tree shown in FIG. 8. In bottom-up hierarchical clustering, 
the set of all target objects in FIG. 7 would be grouped into 
numerous small clusters, namely (a, b), d, (c, f), e, (g,k), (h, 
i), and j. These clusters would then themselves be grouped 
into the larger clusters (a, b, d), (c, e, f), (g, k), and (h, i, j), 
according to their cluster profiles. These larger clusters 
would themselves be grouped into (a, b, c, d, e, f) and (g, k, 
h, i, j), and so on until all target objects had been grouped 
together, resulting in the tree of FIG. 8. Note that for 45 
bottom-up clustering to work, it must be possible to apply 
the clustering algorithm to a set of existing clusters. This 
requires a notion of the distance between two clusters. The 
method disclosed above for measuring the distance between 
target objects can be applied directly, provided that clusters 
are profiled in the same way as target objects. It is only 
necessary to adopt the convention that a cluster's profile is 
the average of the target profiles of all the target objects in 
the cluster; that is, to determine the cluster's value for a 
given attribute, take the mean value of that attribute across 
all the target objects in the cluster. For the mean value to be 
well-defined, all attributes must be numeric, so it is neces- 
sary as usual to replace each textual or associative attribute 
with its decomposition into numeric attributes (scores), as 
described earlier. For example, the target profile of a single 
Woody Allen film would assign "Woody- Allen" a score of 1 
in the "name-of-director" field, while giving "Federico- 
Fellini" and "Terence -Davies" scores of 0. A cluster that 
consisted of 20 films directed by Allen and 5 directed by 
Fellim would be profiled with scores of 0.8, 0.2, and 0 
respectively, because, for example, 0.8 is the average of 20 
ones and 5 zeros. 
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13A and 13B): 

1 . Initialize list of identified target objects to the empty list 
at step 13A00 

2. Initialize the current tree T to be the hierarchical cluster 
tree of all objects at step 13A01 and at step 13 ACE scan 
the current cluster tree for target objects similar to P, 
using the process detailed in FIG. 13B. At step 13A03, 
the list of target objects is returned. 

3. At step 13B0O, the variable I is set to 1 and for each 
child subtree Ti of the root of tree T, is retrieved. 

4. At step 13B02, calculate d(P, p^), the similarity distance 
between P and p,, 

5. At step 13B03, if d(P, p^-ct, a threshold, branch to one 
of two options 

6. If tree Ti contains only one target object at step 13B04, 
add that target object to list of identified target objects 
at step 13B05 and advance to step 13B07. 

7. If tree Ti contains multiple target objects at step 13B04, 
scan the ith child subtree for target objects similar to P 
by invoking the steps of the process of FIG. 13B 
recursively and then recurse to step 3 (step 13A01 in 
FIG. 13A) with T bound for the duration of the recur- 
sion to tree Ti, in order to search in tree Ti for target 
objects with profiles similar to P. 

In step 5 of this pseudo-code, smaller thresholds are 
typically used at lower levels of the tree, for example by 
making the threshold an affine function or other function of 
the cluster variance or cluster diameter of the cluster p f . If 
the cluster tree is distributed across a plurality of servers, as 
described in the section of this description titled "Network 
Context of the Browsing System'*, this process may be 
executed in distributed fashion as follows: steps 3-7 are 
executed by the server that stores the root node of hierar- 
chical cluster tree T, and the recursion in step 7 to a 
subcluster tree T ( involves the transmission of a search 
request to the server that stores the root node of tree T f , 
which server carries out the recursive step upon receipt of 
this request. Steps 1-2 are carried out by the processor that 
initiates the search, and the server that executes step 6 must 
send a message identifying the target object to this initiating 
processor, which adds it to the list. 

Assuming that low-level clusters have been already been 
formed through clustering, there are alternative search meth- 
ods for identifying the low-level cluster whose profile is 
most similar to a given target profile P. A standard back- 
propagation neural net is one such method: it should be 



05/01/2003, EAST Version: 1.03.0007 



6,02 

27 

trained to take the attributes of a target object as input, and 
produce as output a unique pattern that can be used to 
identify the appropriate low-level cluster. For maximum 
accuracy, low-level clusters that are similar to each other 
(close together in the cluster tree) should be given similar 
identifying patterns. Another approach is a standard decision 
tree that considers the attributes of target profile P one at a 
time until it can identify the appropriate cluster. If profiles 
are large, this may be more rapid than considering all 
attributes. A hybrid approach to searching uses distance 
measurements as described above to navigate through the 
top few levels of the hierarchical cluster tree, until it reaches 
an cluster of intermediate size whose profile is similar to 
target profile P, and then continues by using a decision tree 
specialized to search for low-level subclusters of that inter- 
mediate cluster. 

One use of these searching techniques is to search for 
target objects that match a search profile from a user's search 
profile set. This form of searching is used repeatedly in the 
news clipping service, active navigation, and Virtual Com- 
munity Service applications, described below. Another use is 
to add a new target object quickly to the cluster tree. An 
existing cluster that is similar to the new target object can be 
located rapidly, and the new target object can be added to 
this cluster. If the object is beyond a certain threshold 
distance from the cluster center, then it is advisable to start 
a new cluster. Several variants of this incremental clustering 
scheme can be used, and can be built using variants of 
subroutines available in advanced statistical packages. Note 
that various methods can be used to locate t he new target 
objects that must be added to the cluster tree, depending on 
the architecture used. In one method, a "webcrawler" pro- 
gram running on a central computer periodically scans all 
servers in search of new target objects, calculates the target 
profiles of these objects, and adds them to the hierarchical 
cluster tree by the above method. In another, whenever a 
new target object is added to any of the servers, a software 
"agent" at that server calculates the target profile and adds 
it to the hierarchical cluster tree by the above method. 
Rapid Profiling 

In some domains, complete profiles of target objects are 
not always easy to construct automatically. When target 
objects are multimedia, for example, an attribute such as 
"genre** (a single textual term such as "Action", "Suspense/ 
Thriller", "Word Games'Vetc.) may be a matter of judgment 
and opinion, difficult to determine except by consulting a 
human. More significantly, if each title has an associated 
attribute that records the positive or negative relevance 
feedback to that title from various human users (consumers) 
then all the association scores of any newly introduced title 
are initially zero so that it is initially unclear what other titles 
are similar to the new title with respect to the users who like 
them. Indeed, if this associative attribute is highly weighted, 
the initial lack of relevance feedback information may be 
difficult to remedy, due to a vicious circle in which users of 
moderate-to-high interest are needed to provide relevance 
feedback but relevance feedback is needed to identify users 
of moderate-to-high interest. 

Fortunately, however, it is often possible in principle to 
determine certain attributes of a new target object by 
extraordinary methods, including but not limited to methods 
that consult a human. For example, the system can in 
principle detennine the genre of a title by consulting one 
more randomly chosen individual from a set of human 
experts, while determining the score between a new title and 
a particular user it can in principle show the title to that user 
and determine relevance feedback. Since such requests 
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inconvenience people, however, it is important not to deter- 
mine all difficult attributes this way, but only the ones that 
are most important is classifyng the article. "Rapid profil- 
ing" is a method for selecting those numeric attributes that 

5 are most important to determine. (Recall that all attributes 
can be decomposed into numeric attributes, such as asso- 
ciation scores or term scores.) First, a set of existing target 
objects that already have complete or largely complete 
profiles are clustered using a k-means algorithm. Next, each 

to of the resulting clusters is assigned a unique identifying 
number, and each clustered target object is labeled with the 
identifying number of its cluster. Standard methods then 
allow construction of a single decision tree that can deter- 
mine any target object's cluster number, with substantial 

15 accuracy, by considering the attributes of the target object, 
one at a time. Only attributes that can if necessary be 
determined for any new target object are used in the con- 
struction of this decision tree. To profile a new target object, 
the decision tree is traversed downward from its root as far 

20 as is desired. The root of the decision tree considers some 
attribute of the target object. If the value of this attribute is 
not yet known, it is determined by a method appropriate to 
that attribute; for example, if the attribute is the association 
score of the target object with user #4589, then relevance 

25 feedback (to be used as the value of this attribute) is solicited 
from user #4589, perhaps by the ruse of adding the possibly 
uninteresting target object to a set of objects that the system 
recommends to the user's attention, in order to find out what 
the user thinks of it. Once the root attribute is determined, 

30 the rapid profiling method descends the decision tree by one 
level, choosing one of the decision subtrees of the root in 
accordance with the determined value of the root attribute. 
The root of this chosen subtree considers another attribute of 
the target object, whose value is likewise determined by an 

35 appropriate method. The process c an be repeated to deter- 
mine as many attributes as desired, by whatever methods are 
available, although it is ordinarily stopped after a small 
number of attributes, to avoid the burden of determining too 
many attributes. 

40 It should be noted that the rapid profiling method can be 
used to identify important attributes in any sort of profile, 
and not just profiles of target objects. In particular, recall that 
the disclosed method for determining topical interest 
through similarity requires users as well as target objects to 

45 have profiles. New users, like new target objects, may be 
profiled or partially profiled through the rapid profiling 
process. For example, when user profiles include an asso- 
ciative attribute that records the user's relevance feedback 
on all target objects in the system, the rapid profiling 

50 procedure can rapidly form a rough characterization of a 
new user's interests by soliciting the user's feedback on a 
small number of significant target objects, and perhaps also 
by determining a small number of other key attributes of the 
new user, by on-line queries, telephone surveys, or other 

55 means. Once the new user has been partially profiled in this 
way, the methods disclosed above predict that the new user's 
interests resemble the known interests of other users with 
similar profiles. In a variation, each user's user profile is 
subdivided into a set of long-term attributes, such as demo- 

60 graphic characteristics, and a set of short-term attributes that 
help to identify the user's temporary desires and emotional 
state, such as the user's textual or multiple-choice answers 
to questions whose answers reflect the user's mood. A subset 
of the user's long-term attributes are determined when the 

65 user first registers with the system, through the use of a rapid 
profiling tree of long-term attributes. In addition, each time 
the user logs on to the system, a subset of the user's 
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short-term attributes are additionally determined, through tive measure can be determined through averaging of results 

the use of a separate rapid profiling tree that asks about across all participating users on an attribute specific basis, 

short-term attributes. Using the techniques described above which allow for 

Market Research pseudonymous credentialing of users or organizations by 

A technique similar to rapid profiling is of interest in 5 other entities, these evaluation based attributes may be 

market research (or voter research). Suppose that the target automatically ascribed to each product in the form of 

objects are consumers. A particular attribute in each target credentials, also manually ascribed comments or descrip- 

profile indicates whether the consumer described by that tions may be (provided and subsequently rated by other 

target profile has purchased product X. A decision tree can users) to further leverage consumer participation in adding 

be built that attempts to determine what value a consumer 10 characterization attributes to a given product's or entities 

has for this attribute, by consideration of the other attributes profile. These averaged consumer rating based credentials 

in the consumer's profile. This decision tree may be tra- also act as a means of normalizing biased opinions or rogue 

versed to determine whether additional users are likely to attempts to defame a product or entity and thus are used to 

purchase product X. More generally, the top few levels of substantiate claims which consumers have provided and 

the decision tree provide information, valuable to advertisers 15 other consumers have substantiated either in the form of 

who are planning mass-market or direct-mail campaigns, on-line or off-line advertisements and coupons. Comparative 

about the most significant characteristics of consumers of ratings of competitive products are achievable by targeting 

product X. users which have experience with (two or more) products 

Similar information can alternatively be extracted from a being compared. The most relevant attributes which both 
collection of consumer profiles without recourse to a deci- 20 products share are presented using these rapid profiling 
sion tree, by considering attributes one at a time, and techniques. In order to develop a truly robust statistically 
identifying those attributes on which product X's consumers confident comparison across all products on an attribute by 
differ significantly from its non-consumers. These tech- attribute basis, it is important to use this comparative prod- 
niques serve to characterize consumers of a particular prod- uct rating approach, to identify automatically which product 
uct; they can be equally well applied to voter research or 25 comparisons are most statistically relevant in order to pro- 
other survey research, where the objective is to characterize vide statistical confidence for all products being evaluated 
those individuals from a given set of surveyed individuals (in this comparative product context) to validation of the 
who favor a particular candidate, hold a particular opinion, values of each attribute using different combinations of 
belong to a articular demographic group, or have some other product comparisons is important in order to assure statis- 
set of distinguishing attributes. Researchers may wish to 30 tical confidence (between different users). These rated 
purchase batches of analyzed or unanalyzed user profiles attribute credentials may also be segmented by user types 
from which personal identifying information has been using knowledge discovery techniques. For example, it is 
removed. As with any statistical database, statistical conclu- possible that users of a certain demographic, product affinity 
sions can be drawn, and relationships between attributes can or other attribute type may have different preferences 
be elucidated using knowledge discovery techniques which 35 demands or expectations, thus may evaluate a product's 
are well known in the art. overall quality or value (or other product attribute) differ- 

T ^ rTTT , ently. Additionally, these credentials may be provided as 

CONSUMER-BASED BETTER BUSINESS resolution credentials, for example in combination with a 

BUREAU credential provided by a neutral third party which proves 

In the case of profiling new products, a decision tree may 40 that the user is in good standing with its customers (that a 

be useful for determining its profile quickly (for example if "significant" number of complaints were not submitted), 

certain general attributes are known about the product). Brokerage exchanges which match buyers and sellers and/or 

Rapid profiling may also be used to automatically present a act as a directory thereof may wish to apply these techniques 

selection of attributes (of at least two) with which a user in order to provide users with some unbiased feedback from 

selects which attribute most aptly describes the product 45 peers about products and services being solicited peer to 

and/or provides a weighted value of its relevance thereto. peer rating based resolution credentials. It is also possible to 

Alternatively, the decision tree presents (for each node) at automatically present a set of survey questions to a group of 

least one exemplar item which the user rates indicating the users who have been previously interacting on-line with 

degree of similarity between the system presented item(s) another user. Because of the subjective nature involved in 

and the new item of interest. Additionally, for the sake of 50 characterizing individuals based upon their personal, or even 

optimizing the confidence of the users being surveyed, the professional proficiencies and weaknesses, human involve- 

decision tree may also identify the user whose profiles ment in providing manual characterizations of a sample of 

suggest the greatest degree of similarity with the attributes users is necessary. The nature of the interaction (an 

or items being presented as queries. In one variation in this associate, professional, personal, or social) may be deter- 

regard, the system selects users which are most familiar with 55 mined through automatic means (based on the content 

two or more competitive products. The system performs a profiles of dialogues and lists of "similar" users which they 

rapid profiling of these users, however, for product attributes interact with) in order to automatically ascribe an associative 

which are most relevant to both products (which is produced attribute which identifies both other individuals, his/her 

from the result of combining or averaging both product relationship with the user and the nature of their interaction, 

profiles). Example attributes which are most telling about 60 Individuals may be automatically presented with targeted 

the user's perception of comparative value and quality when questions appropriate to the nature thereof in accordance 

making a selection may include: performance, aesthetics, with their mutual relationship through anticipation of which 

comfort, convenience of use, value, overall satisfaction, attributes or queries other individuals (like friends, 

personal preference, as well as other relevant specific prod- associates, business partners or employers) are most likely 

uct attributes which may be determined as a part of the 65 to request in the future. These questions are ideally 

user's profile. By applying this technique over multiple requested from multiple users, their values are then averaged 

product brands within a given category, a relative, compara- and may be ascribed to that user as resolution credentials. In 
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case of disputes mediation by a judicaling third party may be profiling of target objects in this complex domain may be 

required. Additionally, the system may further anticipate the further enhanced by establishing exception in the form of 

types of questions which are most likely to be requested by special appropriateness function rules between the textual, 

other users in the future. This approach may also be used by descriptive, and numeric attributes of those targeted objects 

the system to profile skills sets, qualifications, issues of 5 (e.g. the qualification of the users, the textual attributes in 

personality, character or qualification to perform a particular the description of each task, and the evaluative description 

task- It may also direct queries to the users most likely to be of the recipients of the task solutions provided. As in other 

qualified knowledgeable in certain popular domains, which informational domains, the exception rules which apply to a 

are most likely to be relevant (and thus anticipate the types particular domain are given priority over those which apply 

of queries that other users are likely to request. Similarly, 10 to another domain. (Again, where cross correlation statistics 

users may be used to answer questions or provide descrip- are given second priority in order to maximize statistical 

tive characterizations of certain tasks or queries using rapid confidence). Such exception rules may include (but are not 

profiling in this way as well. Thus, tasks, (consulting on the limited to) giving special relevance between a word attribute 

internet, intranet, etc.) may be profiled according to the types based upon the sequence in which those textual attributes 

of users who ascribe, subjective, or objective attributes to is appear in the description, (or in the presence or absence of 

best describe the task, or attributes may be ascribed which a numeric attribute in combination with a numeric attribute 

characterize the most appropriate individuals according to or a textual attribute). (These associations may also be based 

their professional qualifications or other relevant attributes, on their relative frequencies in the text as well) or more 

such as the tasks which they have successfully performed. complex rules may be established automatically. 

Accordingly, task attributes may also be conveyed to the 20 Furthermore, if the combination of words appear, and the 

best candidates to whom these tasks are directed. As request is from a particular user it is likely that a particular 

suggested, task performance may be manually evaluated in detailed target profile is appropriate for the target object. By 

order to provide the system with a source of performance definition, exception rules apply exceptions in the weighting 

based relevance feedback. The users who submitted the task values of attributes or an attribute with an exception is 

offers are given the opportunity to provide an evaluation of 2s present (or at least one of) at least three attributes which are 

the level of the quality of the work (or query response) as present in a particular (user or target object) profile whose 

well as overall satisfaction regarding the response to the attribute weighting influence upon another attribute would 

request offer. The requester may provide an evaluation in the not otherwise be recognized in a pure (non-rule based) 

form of a set of feedback comments. Additionally, the rapid statistical model (customized) profiles of requests which is 

profiling technique will automatically generate a set of the 30 specific to each user may be used as each user may submit 

most relevant attributes in the form of a survey which allow similar requests in a different descriptive manner (with 

the user to rate the attributes according to each relevant varying word usage). The user's needs may also vary based 

attribute parameter as perceived by the user. (These upon the context of what actions the user has recently 

attributes may, of course, include those which are humanly performed e.g., searching through particular topics of the 

ascribed as well). Unlike the method for automatic query 3S World Wide Web, searching through e-mail, conversing with 

routing the current system for finding optimal user skill particular users about a particular topic of engaging in these 

profiles to match the particular submitted task description, activities at certain times or in conjunction with any of the 

the current system potentially embodies a much more com- above which may indicate the context of the user's mode of 

plex knowledge construction requiring precision-oriented activities such as work, leisure or academics. If a particular 

statistical knowledge about the nature of the user's numer- 40 combination of words appears and it is from a particular 

ous skill sets and the submitted tasks. request as part of the description of a request from a 

It may be very useful to use associative attributes to particular individual, the relevance of each attribute com- 

identify the relevant words in the task description and users ponent of the request may be different to some degree than 

who successfully provided solutions and responses to simi- the request from a different individual (wherein this case 

larly described tasks in the past. According to the previously 45 these exception rules are relevant to particular users), 

described techniques of the patent, the collection of target Accordingly, the sequence of words which appear (for a 

objects in this particular information domain include task particular word combination) may be suggestive of the 

descriptions; solutions to the requests, individuals who have relative importance of particular words to one another or to 

provided solutions to those tasks, individuals whose profiles a particular solution or a particular individual. Accordingly 

qualify them for solving particular problem types, and so in the application to matching queries or tasks with users 

individuals who are most likely to have a need for solution according to their qualifications for the particular combina- 

to a particular type of problem. As suggested each of these tion of qualifying credentials which a user possesses may 

types of target objects may constitute the information space indicate an exception rule either between particular 

of the presently described system for customized electronic credentials, between credentials and individual tasks (or 

identification of desirable objects. Thus in order to augment 55 between credentials and textual attributes in the text of task 

the search retrieval process the user may also be directed to descriptions). Exception rules are not applicable for asso- 

potentially useful information through, menu browsing and ciative attributes which associate target objects users (or 

search query navigation (and nearest neighbor, target object both) via the present similarity based techniques, 

to target object) navigation down or across the menu as well SUPPORTING ARCHITECTURE 

as the current matching of appropriate users with requests 60 

are herein described. Accordingly, as relevant in the other The following section describes the preferred computer 

informational domains (if the target object profiles) and the and network architecture for implementing the methods 

similarity between target objects is not statistically confident described in this patent. 

the system will cross correlate the statistical data from other Electronic Media System Architecture 

informational domains in order to assign the most appropri- 65 FIG. 1 illustrates in block diagram form the overall 

ate profile for each of target object for which a sparse data architecture of an electronic media system, known in the art, 

problem currently exists. In a more advanced embodiment, in which the system for customized electronic identification 
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of desirable objects of the present invention can be used to 
provide user customized access to target objects that are 
available via the electronic media system. In particular, the 
electronic media system comprises a data communication 
facility that interconnects a plurality of users with a number 5 
of information servers. The users are typically individuals, 
whose personal computers (terminals) T 1 -T„ are connected 
via a data communications link, such as a modem and a 
telephone connection established in well-known fashion, to 
a telecommunication network N. User information access 10 
software is resident on the user's personal computer and 
serves to communicate over the data communications link 
and the telecommunication network N with one of the 
plurality of network vendors V 1 _ Mt (America Online, 
Prodigy, CompuServe, other private companies or even 15 
universities) who provide data interconnection service with 
selected ones of the information servers The user can, 
by use of the user information access software, interact with 
the information servers I a -I m to request and obtain access to 
data that resides on mass storage systems -SS m that are part 20 
of the information server apparatus. New data is input to this 
system y users via their personal computers Tj-T„ and by 
commercial information services by populating their mass 
storage systems SS 1 -SS m with commercial data. Each user 
terminal T x -T n and the information servers ^-1^ have 25 
phone numbers or IP addresses on the network N which 
enable a data communication link to be established between 
a particular user terminal T 3 -T„ and the selected information 
server Ij-1^. A user's electronic mail address also uniquely 
identifies the user and the user's network vendor Vj-V* in 30 
an industry-standard format such as: username@aol.com or 
usemame@netcom.com. The network vendors V^-V^ pro- 
vide access passwords for their subscribers (selected users), 
through which the users can access the information servers 
I a -I m . The subscribers pay the network vendors Vj-V* for 35 
the access services on a fee schedule that typically includes 
a monthly subscription fee and usage based charges. A 
difficulty with this system is that there are numerous infor- 
mation servers I a — 1^ located around the world, each of 
which provides access to a set of information of differing 40 
format, content and topics and via a cataloging system that 
is typically unique to the particular information server ^-I^. 
The information is comprised of individual "files," which 
can contain audio data, video data, graphics data, text data, 
structured database data and combinations thereof. In the 45 
terminology of this patent, each target object is associated 
with a unique file: for target objects that are informational in 
nature and can be digitally represented, the file directly 
stores the informational content of the target object, while 
for target objects that are not stored electronically, such as 50 
purchasable goods, the file contains an identifying descrip- 
tion of the target object. Target objects stored electronically 
as text files can include commercially provided news 
articles, published documents, letters, user-generated 
documents, descriptions of physical objects, or combina- 55 
tions of these classes of data. The organization of the files 
containing the information and the native format of the data 
contained in files of the same conceptual type may vary by 
information server ^-I^. 

Thus, a user can have difficulty in locating files that 60 
contain the desired information, because the information 
may be contained in files whose information server catalog- 
ing may not enable the user to locate them. Furthermore, 
there is no standard catalog that defines the presence and 
services provided by all information servers I A — I m . A user 65 
therefore does not have simple access to information but 
must expend a significant amount of time and energy to 
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excerpt a segment of the information that may be relevant to 
the user from the plethora of information that is generated 
and populated on this system. Even if the user commits the 
necessary resources to this task, existing information 
retrieval processes lack the accuracy and efficiency to ensure 
that the user obtains the desired information. It is obvious 
that within the constructs of this electronic media system, 
the three modules of the system for customized electronic 
identification of desirable objects can be implemented in a 
distributed manner, even with various modules being imple- 
mented on and/or by different vendors within the electronic 
media system. For example, the information servers \j-l m 
can include the target profile generation module while the 
network vendors Vj— V t may implement the user profile 
generation module, the target profile interest summary gen- 
eration module, and/or the profile processing module. A 
module can itself be implemented in a distributed manner, 
with numerous nodes being present in the network N, each 
node serving a population of users in a particular geographic 
area. The totality of these nodes comprises the functionality 
of the particular module. Various other partitions of the 
modules and their functions are possible and the examples 
provided herein represent illustrative examples and are not 
intended to limit the scope of the claimed invention. For the 
purposes of pseudonymous creation and update of users' 
target profile interest summaries (as described below), the 
vendors Vj-V* may be augmented with some number of 
proxy servers, which provide a mechanism for ongoing 
pseudonymous access and profile building through the 
method described herein. At least one trusted validation 
server must be in place to administer the creation of pseud- 
onyms in the system. 

An important characteristic of this system for customized 
electronic identification of desirable objects is its 
responsiveness, since the intended use of the system is in an 
interactive mode. The system utility grows with the number 
of the users and this ncreases the number of possible 
consumer/product relationships between users and target 
objects. A system that serves a large group of users must 
maintain interactive performance and the disclosed method 
for profiling and clustering target objects and users can in 
turn be used for optimizing the distribution of data among 
the members of a virtual community and through a data 
communications network, based on users' target profile 
interest summaries. 

Network Elements and System Characteristics 

The various processors interconnected by the data com- 
munication network N as shown in FIG. 1 can be divided 
into two classes and grouped as illustrated in FIG. 2: clients 
and servers. The clients Cl-Cn are individual user's com- 
puter systems which are connected to servers S1-S5 at 
various times via data communications links. Each of the 
clients Ci is typically associated with a single server Sj, but 
these associations can change over time. The clients Cl-Cn 
both interface with users and produce and retrieve files to 
and from servers. The clients Cl-Cn are not necessarily 
continuously on-fine, since they typically serve a single user 
and can be movable systems, such as laptop computers, 
which can be connected to the data communications network 
N at any of a number of locations. Clients could also be a 
variety of other computers, such as computers and kiosks 
providing access to customized information as well as 
targeted advertising to many users, where the users identify 
themselves with passwords or with smart cards. A server Si 
is a computer system that is presumed to be continuously 
on-line and functions to both collect files from various 
sources on the data communication network N for access by 
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local clients Cl-Cn and collect files from local clients 
Cl-Cn for access by remote clients. The server Si is 
equipped with persistent storage, such as a magnetic disk 
data storage medium, and are interconnected with other 
servers via data communications links. The data communi- 
cations links can be of arbitrary topology and architecture, 
and are described herein for the purpose of simplicity as 
point-to-point links or, more precisely, as virtual point-to- 
point links. The servers S1-S5 comprise the network ven- 
dors Vl-Vk as well as the information servers 1 3 -I m of FIG. 
1 and the functions performed by these two classes of 
modules can be merged to a greater or lesser extent in a 
single server Si or distributed over a number of servers in the 
data communication network N. Prior to proceeding with the 
description of the preferred embodiment of the invention, a 
number of terms are defined. FIG. 3 illustrates in block 
diagram form a representation of an arbitrarily selected 
network topology for a plurality of servers A-D, each of 
which is interconnected to at least one other server and 
typically also to a plurality of clients p-s. Servers A-D are 
interconnected by a collection of point to point data com- 
munications links, and server A is connected to client r, 
server B is connected to clients p-q, while server D is 
connected to client s. Servers transmit encrypted or unen- 
crypted messages amongst themselves: a message typically 25 
contains the textual and/or graphic information stored in a 
particular file, and also contains data which describe the type 
and origin of this file, the name of the server that is supposed 
to receive the message, and the purpose for which the file 
contents are being transmitted. Some messages are not 
associated with any file, but are sent by one server to other 
servers for control reasons, for example to request transmis- 
sion of a file or to announce the availability of a new file. 
Messages can be forwarded by a server to another server, as 
in the case where server A transmits a message to server D 
via a relay node of either server C or servers B, C. It is 
generally preferable to have multiple paths through the 
network, with each path being characterized by its perfor- 
mance capability and cost to enable the network N to 
optimize traffic routing. In one particular implementation 
which is increasingly used on the World Wide Web, "chan- 
nels" of content are used to enable users to select topically 
relevant areas of interest (e.g., National Geographic, Forbes, 
The Wall Street Journal, USA Today, The Disney Channel, 
Wired, CNN). These channels may be either accessed on 
demand, downloaded in advance to the user (as part of a 
"virtual" subscription) or selectively retrieved wherein the 
user's profile dictates the items selected. In this approach the 
items may be actively prefetched or filtered from a live chat 
stream. Similarly the current methods for the custom news 
filter may be used in this application to selectively filter and 
present the most relevant programming selections to the 
user, thus creating a "virtual channel". The basis for this 
concept (using a one way down stream delivery architecture) 
was detailed in paten pending. 

In accordance with the techniques presently suggested, 
just as categories of information contain profiles, the most 
appropriate information (e.g., news information) can be 
automatically routed to the most appropriate category. Simi- 
larly content may be automatically routed to the most 
appropriate virtual channels which appeal to a particular 
type of audience (not only based on its content, but more 
subjective criteria as well) offering a unique multi media 
experience, writing or commentary style of its authors, etc. 
For this reason it may be most appropriate to initially gather 
relevance feedback of which users access the information in 
order to develop statistical confidence as to its associative 
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attributes before it is routed to a particular channel. For 
example, in this regard as with the presently described 
techniques for customizing content through indexing, navi- 
gation and delivery from the entire scope of available 
information on the Internet, the scope of information may be 
narrowed to that of a particular channel. Additionally, 
because considerable overlap of content may occur between 
channels, authors and editors of a particular channel may use 
this technique to select the most desirable content from 
which appropriate editing and revisions may be performed 
as desired. These channels ideally are presented in combi- 
nation with virtual communities (e.g., virtual text and voice 
chat rooms). They may accordingly be navigated to/from as 
part of the 3-D representation of the surrounding informa- 
tion space. For example virtual chat room associated with a 
news channel may incorporate scheduled live interviews 
with news reporters (or news makers) who had covered (or 
had been involved in) a particular story or combination of 
stories during which time participants may submit questions 
or comments (pseudonymously if desired). Polls may be 
taken about these users views on each particular event or 
controversial issues that are newsworthy. As suggested, 
preference based attributes, demographics and psychologi- 
cal user attributes may be statistically correlated with certain 
news from survey question responses or as otherwise sub- 
mitted (such as in the form of active comments about that 
particular issue). Because questions and comments from 
many users may bombard a particular chat room, automated 
methods may be used to more efficiently manage large 
quantities of data. Specifically, the system may apply the 
following techniques: 

1. Real time automatic identification of similar queries or 
comments which had been previously submitted (using 
statistical NLP or deeper NLU techniques). Once a user 
has submitted a question or comment, the system 
instantaneously indexes any similar item(s) previously 
submitted, automatically notifies the user that the user's 
submission has been canceled and automatically 
retrieves the previously submitted response to that 
previously submitted item. In the context of an ascribed 
posting to news groups currently known techniques 
such as auto-FAQ are able to generate FAQs automati- 
cally. For either live chat or (asynchronous) 
newsgroups, this technique may instead be used to 
eliminate redundancy by identifying (by indexing in 
real time via statistical NLP) pre-existing similar cor- 
respondences to those which are about to be initiated. 

2. Automatically determine the predicted value of a user's 
comments and responses. This may be determined as 
the product of number and length of comments sub- 
mitted in response to that user's postings, as well as the 
estimated predicted value of the response based upon 
the estimated value of that associated particular respon- 
dent's knowledge within the knowledge domain of the 
content profile of that response as well as the time that 
users spend reading the posting from the user's interest 
profile. Again, the relevance of this factor is also the 
product of the reader's knowledge within the knowl- 
edge domain of the content profile of the user's mes- 
sage. In the application to a future guest or moderator 
of a bulletin board or chat room (or a variation thereof 
called a "virtual talk show" in which the moderator 
fields questions by participants) the most predictively 
"valuable" questions, comments and/or responses are 
selectively prioritized for submission and reading (if a 
response) by the other participants. For the newsgroup 
application, items which arc highest priority are pre- 
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seated first, responses to the same which are of highest be used freely and without inhibition by users without fear 
priority are posted. Additionally, an item which is very of invasion of privacy. It is likely that user s desire that some, 
similar though not as "valued" may also receive a lower if not all, of the user-specific information in their user 
value score which is less valuable though more unlike profiles and target profile interest summaries remain 
other items. In the application to live chat because the 5 confidential, to be disclosed only under certain circum- 
associative attribute of the list of readers of the item is stances related to certain types of transactions and according 
unavailable (in real time) instead, the real time profiling to their personal wishes for differing levels of confidentiality 
of the message is performed and any predictive value regarding their purchases and expressed interests, 
estimated based upon that user's determined skill However, complete privacy and inaccessibility of user 
(value) within the knowledge domain of hisfter mes- 10 transactions and profile summary information would hinder 
sage. Additionally, value estimation may be converted implementation of the system for customized electronic 
to actual price values (using the exchange of soft identification of desirable objects and would deprive the user 
currency) as a variation of the price point determination of many of the advantages derived through the system's use 
scheme. In this regard, dialogues, users submitted of user-specific information. In many cases, complete and 
queries and anticipated responses thereto are appro- is total privacy is not desired by all parties to a transaction. For 
priately matched, priced (value appraised), a "net bal- example, a buyer may desire to be targeted for certain 
ance" is automatically determined for each informa- mailings that describe products that are related to his or her 
tional exchange (or transaction) and each user's interests, and a seller may desire to target users who are 
"account" is debited or credited accordingly. If desired, predicted to be interested in the goods and services that the 
participants external to a particular transaction may 20 seller provides. Indeed, the usefulness of the technology 
passively observe the net cost of each transaction, the described herein is contingent upon the ability of the system 
price and, if the user perceives the estimated value to be to collect and compare data about many users and many 
inappropriate, he/she may submit a suggested modifi- target objects. A compromise between total user anonymity 
cation of its value. These recommendations may be and total public disclosure of the user's search profiles or 
averaged in order to determine the most appropriate net 25 target profile interest summary is a pseudonym. A pseud- 
transaction value. Again the relevance may be adjusted onym is an artifact that allows a service provider to com- 
to the recommendation in accordance with the skill of municate with users and build and accumulate records of 
that user within that knowledge domain for determining their preferences over time, while at the same time remain- 
the actual modified value. This approach may be ing ignorant of the users* true identities, so that users can 
applied also on the context of Intranet (or multi- 30 keep their purchases or preferences private. A second and 
organizational Intranet). ^lly important requirement of a pseudonym system is 
Several applications to bandwidth content delivery may be that it provide for digital credentials, which are used to 
included, including video on demand wherein video and guarantee that the user represented by a particular pseud- 
audio programming content may be delivered to the user. onym has certain properties. These credentials may be 
Techniques for customizing program guide selections to 35 granted on the basis of result of activities and transactions 
users have been detailed in the patent pending patent entitled conducted by means of the system for customized electronic 
"System and Method for Scheduling Broadcast and Access identification of desirable objects, or on the basis of other 
to Video Program and Other Data" Using Customer Pro- activities and transactions conducted on the network N of 
files". The present system may readily be applicable to radio the present system, on the basis of users 1 activities outside 
programming sent over cable (or the Internet). Particularly 40 of network N. For example, a service provider may require 
for short programming selections like music, music video proof that the purchaser has sufficient funds on deposit at 
and short audio or multimedia segments, it is desirable to his/her bank, which might possibly not be on a network, 
automate the selection process by creating a "virtual chan- before agreeing to transact business with that user. The user 
nel" of selections which are retrieved sequentially. As pre- therefore, must provide the service provider with proof of 
viously described, existing channels may be accessible to 45 funds (a credential) from the bank, while still not disclosing 
users on the WWW. These techniques for automated sequen- the user's true identity to the service provider, 
tial of retrieval of content may be another implementation of Our method solves the above problems by combining the 
another channel (e.g., using cable as a high bandwidth pseudonym granting and credential transfer methods taught 
transmission medium to access a video server on the by D. Chaum and J. H. Evertse, in the paper tided "A secure 
WWW) Another application of this architecture could be 50 and privacy-protecting protocol for transmitting personal 
use of a client processor in a video store which receives information between organizations," with the tmplementa- 
purchases from the user's account, is maintained on the local uon of a set of one or more proxy servers distributed 
server and the similarity measurements are processed locally throughout the network N. Each proxy server, for example 
or performed by a video server which may deliver high S2 in FIG. 2, is a server which communicates with clients 
bandwidth video, audio (e.g., music) or multi media soft- 55 and other servers S5 in the network either directly or through 
ware to a compact disc at the store which is customized to anonymizing mix paths as detailed in the paper by D. Chaum 
the user's preferences. If user purchasing records don't yet titled "Untraceable Electronic Mail, Return Addresses and 
exist or are not complete, the rapid profiling system may Digital Pseudonyms," published in Communications of the 
construct the user's profile. This system may be imple- ACM, Volume 24, Number 2, February 1981. Any server in 
mented as a stand alone credit card or smart card enabled 60 the network N may be configured to act as a proxy server in 
kiosk which may be equipped with (for example) the cur- addition to its other functions. Each proxy server provides 
rently described menu navigation and query techniques. service to a set of users, which set is termed the "user base- 
Proxy Servers and Pseudonymous Transactions of that proxy server. A given proxy server provides three 
while the method of using target profile interest summa- sorts of service to each user U in its user base, as follows: 
ries presents many advantages to both target object provid- 65 1 . The first finction of the proxy server is to bidirectionally 
ers and users, there are important privacy issues for both transfer communications between user U and other 
users and providers that must be resolved if the system is to entities such as information servers (possibly including 
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the proxy server itself) and/or other users. Specifically, action systems to make Big-Brother obsolete'*, published in 
letting S denote the server that is directly associated the Communications of the ACM, 28(10), October 1985; 
with user IPs client processor, the proxy server com- pp. 1030-1044, incorporated herein, provides for a media - 
municates with server S (and thence with user U), n ism to enforce protection against this type of behavior 
either through anonymizing mix paths that obscure the 5 through the use of resolution credentials, which are creden- 
identity of server S and user U, in which case the proxy that arc periodically provided to individuals contingent 
server knows user U only through a secure pseudonym, upon their behaving consistent with the agreed upon terms 
or else through a conventional virtual point-to-point 0 f service between the user and information provider and 
connection, in which case the proxy server knows user network vendor entities (such as regular payment for ser- 
if by user U's address at server S, which address may 10 vices rendered, civil conduct, etc.). For the user's safety, if 
be regarded as a non-secure pseudonym for user U. the issuer of a resolution credential refuses to grant this 

2. A second function of the proxy server is to record resolution credential to the user, then the refusal may be 
user-specific information associated with user U. This appealed to an adjudicating third party. The integrity of the 
user-specific information includes a user profile and user profiles and target profile interest summaries stored on 
target profile interest summary for user U, as well as a is proxy servers is important: if a seller relies on such user- 
list of access control instructions specified by user U, as specific information to deliver promotional offers or other 
described below, and a set of one-time return addresses material to a particular class of users, but not to other users, 
provided by user U that can be used to send messages then the user-specific information must be accurate and 
to user U without knowing user U's true identity. All of untampered with in any way. The user may likewise wish to 
this user-specific information is stored in a database 20 ensure that other parties not tamper with the user's user 
that is keyed by user U's pseudonym (whether secure profile and target profile interest summary, since such modi- 
or non-secure) on the proxy server. ft cation could degrade the system's ability to match the user 

3. A third function of the proxy server is to act as a with the most appropriate target objects. This is done by 
selective forwarding agent for unsolicited commumca- providing for the user to apply digital signatures to the 
tions that are addressed to user U: the proxy server 25 control messages sent by the user to the proxy server. Each 
forwards some such communications to user U and pseudonym is paired with a public cryptographic key and a 
rejects others, in accordance with the access control private cryptographic key, where the private key is known 
instructions specified by user U. only to the user who holds that pseudonym; when the user 

Our combined method allows a given user to use either a sends a control message to a proxy server under a given 
single pseudonym in all transactions where he or she wishes 30 pseudonym, the proxy server uses the pseudonym's public 
to remain pseudonymous, or else different pseudonyms for key to verify that the message has been digitally signed by 
different types of transactions. In the latter case, each service someone who knows the pseudonym's private key. This 
provider might transact with the user under a different prevents other parties from masquerading as the user, 
pseudonym for the user. More generally, a coalition of Our approach, as disclosed in this application, provides an 
service providers, all of whom match users with the same 35 improvement over the prior art in privacy -protected pseud- 
genre of target objects, might agree to transact with the user onymny for network subscribers such as taught in U.S. Pat. 
using a common pseudonym, so that the target profile No. 5,245,656, which provides for a name translator station 
interest summary associated with that pseudonym would be to act as an intermediary between a service provider and the 
complete with respect to said genre of target objects. When user. However, while U.S. Pat. No. 5,245,656 provides that 
a user employs several pseudonyms in order to transact with 40 the information transmitted between the end user U and the 
different coalitions of service providers, the user may freely service provider be doubly encrypted, the fact that a rela- 
choose a proxy server to service each pseudonym; these tionship exists between user U and the service provider is 
proxy servers may be the same or different. known to the name translator, and this fact could be used to 

From the service provider's perspective, our system pro- compromise user U, for example if the service provider 
vides security, in that it can guarantee that users of a service 45 specializes in the provision of content that is not deemed 
are legitimately entitled to the services used and that no user acceptable by user U's peers. The method of U.S. Pat. No. 
is using multiple pseudonyms to communicate with the same 5,245,656 also omits a method for the convenient updating 
provider. This uniqueness of pseudonyms is important for of pseudonymous user profile information, such as is p ro- 
th e purposes of this application, since the transaction infor- vided in this application, and does not provide for assurance 
mation gathered for a given individual must represent a 50 of unique and credentialed registration of pseudonyms from 
complete and consistent picture of a single user's activities a credentialing agent as is also provided in this application, 
with respect to a given service provider or coalition of and does not provide a means of access control to the user 
service providers; otherwise, a user's target profile interest based on profile information and conditional access as will 
summary and user profile would not be able to represent the be subsequently described. The method described by Loeb et 
user's interests to other parties as completely and accurately 55 al. also does not describe any provision for credentials, such 
as possible. as might be used for authenticating a user's right to access 

The service provider must have a means of protection particular target objects, such as target objects that are 
from users who violate previously agreed upon terms of intended to be available only upon payment of a subscription 

service. For example, if a user that uses a given pseudonym fee, or target objects that are intended to be unavailable to 

engages in activities that violate the terms of service, then 60 younger users, 

the service provider should be able to take action against the Proxy Server Description 

user, such as denying the user service and blacklisting the In order that a user may ensure that some or all of the 

user from transactions with other parties that the user might information in the user's user profile and target profile 

be tempted to defraud. This type of situation might occur interest summary remain dissociated from the user's true 
when a user employs a service provider for illegal activities 65 identity, the user employs as an intermediary any one of a 

or defaults in payments to the service provider. The method number of proxy servers available on the data communica- 

of the paper titled "Security without identification: Trans- tion network N of FIG. 2 (for example, server S2). The 
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proxy servers function to disguise the true identity of the provide for access and reachability control under user and 

user from other parlies on the data communication network proxy server control. 

N. The proxy server represents a given user to either single Validation and Allocation of a Unique Pseudonym 
network vendors and information servers or coalitions Chaum's pseudonym and credential issuance system, as 
thereof A proxy server, e.g. S2, is a server computer with 5 described in a publication by D. Chaum and J. H. Eve rise, 
CPU, main memory, secondary disk storage and network titled "A secure and privacy-protecting protocol for trans- 
communication function and with a database function which mining personal information between organizations," has 
retrieves the target profile interest summary and access several desirable properties for use as a component in our 
control instructions associated with a particular pseudonym system. The system allows for individuals to use different 
P, which represents a particular user U, and performs 10 pseudonyms with different organizations (such as banks and 
bi-directional routing of commands, target objects and bill- coalitions of service providers). The organizations which are 
ing information between the user at a given client (e.g. C3) presented with a pseudonym have no more information 
and other network entities such as network vendors Vl-Vk about the individual than the pseudonym itself and a record 
and information servers U-lm. Each proxy server maintains of previous transactions carried out under that pseudonym, 
an encrypted target profile interest summary associated with 15 Additionally, credentials, which represent facts about a 
each allocated pseudonym in its pseudonym database D. The pseudonym that an organization is willing to certify, can be 
actual user-specific information and the associated pseud- granted to a particular pseudonym, and transferred to other 
onyms need not be stored locally on the proxy server, but pseudonyms that the same user employs. For, example, the 
may alternatively be stored in a distributed fashion and be user can use different pseudonyms with different organiza- 
remotely addressable from the proxy server via point- to- 20 tions (or disjoint sets of organizations), yet still present 
point connections. credentials that were granted by one organization, under one 
The proxy server supports two types of bi-directional pseudonym, in order to transact with another organization 
connections: point-to-point connections and pseudonymous under another pseudonym, without revealing that the two 
connections through mix paths, as taught by D. Chaum in the pseudonyms correspond to the same user. Credentials may 
paper titled "Untraceable Electronic Mail, Return 25 be granted to provide assurances regarding the pseudonym 
Addresses, and Digital Pseudonyms", Communications of bearer's age, financial status, legal status, and the like. For 
the ACM, Volume 24, Number 2, February 1981. The example, credentials signifying "legal adult" may be issued 
normal connections between the proxy server and informa- to a pseudonym based on information known about the 
tion servers, for example a connection between proxy server corresponding user by the given is suing organization. Then, 
S2 and information server S4 in FIG. 2, are accomplished 30 when the credential is transferred to another pseudonym that 
through the point-to-point connection protocols provided by represents the user to another disjoint organization, presen- 
network N as described in the "Electronic Media System tation of this credential on the other pseudonym can be taken 
Architecture" section of mis application. The normal type of as proof of legal adulthood, which might satisfy a condition 
point-to-point connections may be used between S2-S4, for of terms of service. Credential-issuing organizations may 
example, since the dissociation of the user and the pseud- 35 also certify particular facts about a user's demographic 
onym need only occur between the client C3 and the proxy profile or target profile interest summary, for example by 
server S2, where the pseudonym used by the user is avail- granting a credential that asserts "the bearer of this pseud- 
able. Knowing that an information provider such as S4 onym is either well-read or is middle-aged and works for a 
communicates with a given pseudonym P on proxy server S2 large company"; by presenting this credential to another 
does not compromise the true identity of user U. The 40 entity, the user can prove eligibility for (say) a discount 
bidirectional connection between the user and the proxy without revealing the user *s personal data to that entity, 
server S2 can also be a normal point-to-point connection, but Additionally, the method taught by Chaum provides for 
it may instead be made anonymous and secure, if the user assurances that no individual may correspond with a given 
desires, though the consistent use of an anonymizing mix organization or coalition of organizations using more than 
protocol as taught by D. Chaum in the paper titled "Untrace- 45 one pseudonym; that credentials may not be feasibly forged 
able Electronic Mail, Return Addresses, and Digital by the user; and t hat credentials may not be transferred from 
Pseudonyms", Communications of the ACM, Volume 24, one user's pseudonym to a different user's pseudonym. 
Number 2, February 1981. This mix procedure provides Finally, the method provides for expiration of credentials 
untraceable secure anonymous mail between to parties with and for the issuance of "black marks" against Individuals 
blind return addresses through a set of forwarding and return 50 who do not act according to the terms of service that they are 
routing servers termed "mixes". The mix routing protocol, extended. This is done through the resolution credential 
as taught in the Chaum paper, is used with the proxy server mechanism as described in Chaum's work, in which reso- 
S2 to provide a registry of persistent secure pseudonyms that hitions are issued periodically by organizations to pseud- 
can be employed by users other than user U, by information onyms that are in good standing. If a user is not issued this 
providers Il-Im, by vendors Vl-Vk and by other proxy 55 resolution credential by a particular organization or coalition 
servers to communicate with the users in the proxy server's of organization, then this user cannot have it available to be 
user base on a continuing basis. The security provided by transferred to other pseudonyms which he uses with other 
this mix path protocol is distributed and resistant to traffic organizations. Therefore, the user cannot convince these 
analysis attacks and other known forms of analysis which other organizations that be has acted accordance with terms 
may be used by malicious parties to try and ascertain the true 60 of service in other dealings. If this is the case, then the 
identity of a pseudonym bearer. Breaking the protocol organization can use this lack of resolution credential to 
requires a large number of parties to maliciously collude or infer that the user is not in good standing in his other 
be cryptographically compromised. In addition an extension dealings. In one approach organizations (or other users) may 
to the method is taught where the user can include a return issue a list of quality related credentials based upon the 
path definition in the message so the information server S4 65 experience of transaction (or interaction) with the user 
can return the requested information to the user's client which may act similarly to a letter of recommendation as in 
processor C3. We utilize this feature in a novel fashion to a resume. If such a credential is issued from multiple 
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organizations, their values become averaged. In an alterna- Comm. ACM 21, February 2, 120-126. Once a user applies 

live variation organizations may be issued credentials from to server Z for a pseudonym P and is granted a signed 

users such as customers wbich may be used to indicate to pseudonym signed with the private key SK Z of server Z, the 

other future users quality of service which can be expected following protocol takes place to establish an entry for the 

by subsequent users on the basis of various criteria. In one 5 user U in the proxy server S2's database D, 1. The user now 

approach, the system automatically generated the primary sends proxy server S2 the pseudonym, which has been 

attributes contained in the profile of the user or organization. signed by Z to indicate the authenticity and uniqueness of 

Each attribute is then appropriately rated in order to become the pseudonym. The user also generates a PK^ SK^ key pair 

a list of quality related credentials. for use with the granted pseudonym, where is the private key 

In our implementation, a pseudonym is a data record 10 associated with the pseudonym and PK^ is the public key 

consisting of two fields. The first field specifies the address associated with the pseudonym. The user forms a request to 

of the proxy server at which the pseudonym is registered. establish pseudonym P on proxy server S2, by sending the 

The second field contains a unique string of bits (e.g., a signed pseudonym S(P, SK Z ) to the proxy server S2 along 

random binary number) that is associated with a particular with a request to create a new database entry, indexed by P, 

user; credentials take the form of public-key digital signa- 15 and the public key PK^. It envelopes the message and 

hires computed on this number, and the number itself is transmits it to a proxy server S2 through an anonymizing 

issued by a pseudonym administering server Z, as depicted mix path, along with an anonymous return envelope header, 

in FIG. 2, and detailed in a generic form in the paper by D. 2. The proxy server S2 receives the database creation entry 

Chaum and J. H. Evertse, titled "A secure and privacy- request and associated certified pseudonym message. The 

protecting protocol for transmitting personal information 20 proxy server S2 checks to ensure that the requested pseud- 

between organizations.". It is possible to send information to onym P is signed by server Z and if so grants the request and 

the user holding a given pseudonym, by enveloping the creates a database entry for the pseudonym, as well as 

information in a control message that specifies the pseud- storing the user's public key PKp to ensure that only the user 

onym and is addressed to the proxy server that is named in U can make requests in the future using pseudonym P. 3. The 

the first field of the pseudonym; the proxy server may 25 structure of the user's database entry consists of a user 

forward the information to the user upon receipt of the profile as detailed herein, a target profile interest summary as 

control message. detailed herein, and a Boolean combination of access control 

While the user may use a single pseudonym for all criteria as detailed below, along with the associated public 
transactions, in the more general case a user has a set of key for the pseudonym P. 4. At any time after database entry 
several pseudonyms, each of which represents the user in his 30 for Pseudonym P is established, the user U may provide 
or her interactions with a single provider or coalition of proxy server S2 with credentials on that pseudonym, pro- 
service providers. Each pseudonym in the pseudonym set is vided by third parties, which credentials make certain asser- 
designated for transactions with a different coalition of tions about that pseudonym. The proxy server may verify 
related service providers, and the pseudonyms used with one those credentials and make appropriate modifications to the 
provider or coalition of providers cannot be linked to the 35 user's profile as required by these credentials, such as 
pseudonyms used with other disjoint coalitions of providers. recording the user's new demographic status as an adult. It 
All of the user's transactions with a given coalition can be may also store those credentials, so that it can present them 
linked by virtue of the fact that they are conducted under the to service providers on the user's behalf 
same pseudonym, and therefore can be combined to define The above steps may be repeated, with either the same or 
a unified picture, in the form of a user profile and a target 40 a different proxy server, each time user U requires a new 
profile interest summary, of the user's interests vis-a-vis the pseudonym for use with a new and disjoint coalition of 
service or services provided by said coalition. There are providers. In practice there is an extremely small probability 
other circumstances for which the use of a pseudonym may that a given pseudonym may have already been allocated by 
be useful and the present description is in no way intended due to the random nature of the pseudonym generation 
to limit the scope of the claimed invention for example, the 45 process carried out by Z. If this highly unlikely event occurs, 
previously described rapid profiling tree could be used to then the proxy server S2 may reply to the user with a signed 
pseudonymously acquire information about the user which message indicating that the generated pseudonym has 
is considered by the user to be sensitive such as that already been allocated, and asking for a new pseudonym to 
information which is of interest to such entitites as insurance be generated. 

companies, medical specialists, family counselors or dating 50 Pseudonymous Control of an Information Server 
services. Once a proxy server S2 has authenticated and registered 
Detailed Protocol a user's pseudonym, the user may begin to use the services 
In our system, the organizations that the user U interacts of the proxy server S2, in interacting with other network 
with are the servers Sl-Sn on the network N. However, entities such as service providers, as exemplified by server 
rather than directly corresponding with each server, the user 55 S4 in FIG. 2, an information service provider node con- 
employs a proxy server, e.g. S2, as an intermediary between nected to the network. The user controls the proxy server S2 
the local server of the user's own client and the information by forming digitally encoded requests that the user subse- 
provider or network vendor. Mix paths as described by D. quently transmits to the proxy server S2 over the network N. 
Chaum in the paper titled "Untraceable Electronic Mail, The nature and format of these requests will vary, since the 
Return Addresses, and Digital Pseudonyms", Communica- 60 proxy server may be used for any of the services described 
tions of the ACM, Volume 24, Number 2, February 1981 in this application, such as the browsing, querying, and other 
allow for untraceability and security between the client, such navigational functions described below, 
as C3, and the proxy server, e.g. S2. Let S(M,K) represent In a generic scenario, the user wishes to communicate 
the digital signing of message M by modular exponentiation under pseudonym P with a particular information provider or 
with key K as detailed in a paper by Rivest, R. L., Shamir, 65 user at address A, where P is a pseudonym allocated to the 
A., and Adleman, L. Titled "A method for obtaining digital user and A is either a public network address at a server such 
signatures and public-key cryptosystems", published in the as S4, or another pseudonym that is registered on a proxy 
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server such as S4. (In the most common version of this 
scenario, address A is the address of an information provider, 
and the user is requesting that the in formation provider send 
target objects of interest.) The user must form a request R to 
proxy server S2, that requests proxy server S2 to send a 5 
message to address A and to forward the response back to the 
user. The user may thereby communicate with other parties, 
either non-pseudonymous parties, in the case where address 
A is a public network address, or pseudonymous parties, in 
the case where address A is a pseudonym held by, for 
example, a business or another user who prefers to operate 
pseudonymously. 

In other scenarios, the request R to proxy server S2 
formed by the user may have different content. For example, 
request R may instruct proxy server S2 to use the methods 
described later in this description to retrieve from the most 15 
convenient server a particular piece of information that has 
been multicast to many servers, and to send this information 
to the user. Conversely, request R may instruct proxy server 
S2 to multicast to many servers a file associated with a new 
target object provided by the user, as described below. If the 20 
user is a subscriber to the news clipping service described 
below, request R may instruct proxy server S2 to forward to 
the user all target objects that the news clipping service has 
sent to proxy server S2 for the user's attention. If the user is 
employing the active navigation service described below, 25 
request R may instruct proxy server S2 to select a particular 
cluster from the hierarchical cluster tree and provide a menu 
of its subclusters to the user, or to activate a query that 
temporarily affects proxy server S2's record of the user's 
target profile interest summary. If the user is a member of a 30 
virtual community as described below, request R may 
instruct proxy server S2 to forward to the user all messages 
that have been sent to the virtual community. 

Regardless of the content of request R, the user, at client 
C3, initiates a connection to the user's local server SI, and 35 
instructs server SI to send the request R along a secure mix 
path to the proxy server S2, initiating the following sequence 
of actions: 

1. The user's client processor C3 forms a signed message 
S(R, SKp), which is paired with the user's pseudonym 40 
P and (if the request R requires a response) a secure 
one-time set of return envelopes, to form a message M. 

It protects the message M with an multiply enveloped 
route for the outgoing path. The enveloped route s 
provide for secure communication between SI and the 45 
proxy server S2. The message M is enveloped in the 
most deeply nested message and is therefore difficult to 
recover should the message be intercepted by an eaves- 
dropper. 

2. The message M is sent by client C3 to its local server so 
SI, and is then routed by the data communication 
network N from server SI through a set of mixes as 
dictated by the outgoing envelope set and arrives at the 
selected proxy server S2. 

3. The proxy server S2 separates the received message M 55 
into the request message R, the pseudonym P, and (if 
included) the set of envelopes for the return path. The 
proxy server S2 uses pseudonym P to index and retrieve 
the corresponding record in proxy server S2's database, 
which record is stored in local storage at the proxy 60 
server S2 or on other distributed storage media acces- 
sible to proxy server S2 via the network N. This record 
contains a public key PKp, user-specific information, 
and credentials associated with pseudonym P. The 
proxy server S2 uses the public key PK^, to check that 65 
the signed version S(R, SK^) of request message R is 
valid. 
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4. Provided that the signature on request message R is 
valid, the proxy server S2 acts on the request R. For 
example, in the generic scenario described above, 
request message R includes an embedded message Ml 
and an address A to whom message Ml should be sent; 
in this case, proxy server S2 sends message Ml to the 
server named in address A, such as server S4. The 
communication is done using signed and optionally 
encrypted messages over the normal point to point 
connections provided by the data communication net- 
work N. When necessary in order to act on embedded 
message Ml, server S4 may exchange or be caused to 
exchange further signed and optionally encrypted mes- 
sages with proxy server S2, still over normal point to 
point connections, in order to negotiate the release of 
user-specific information and credentials from proxy 
server S2. In particular, server S4 may require server S2 
to supply credentials proving that the user is entitled to 
the information requested — for example, proving that 
the user is a subscriber in good standing to a particular 
information service, that the user is old enough to 
legally receive adult material, and that the user has been 
offered a particular discount (by means of a special 
discount credential issued to the user's pseudonym). 
Such a special discount credential may be automati- 
cally provided by a trusted process residing in the 
proxy server i.e. the price point algorithm. In one 
approach, this special discount credential may persist 
so long as the trusted process on the proxy server 
allows it to (that provides access to an appropriate 
discount by that user, this may be termed "digital 
coupon"). In another variation, the terms of the special 
discount credential may vary in accordance with certain 
user actions (which are pre-specified to the user) e.g. 
automatically modifying the degree or nature of the 
discount in response to user purchasing behavior 
towards that vendor or product (or jointly marketed 
products or a vendor consortium). This may be termed 
a "digital shopper's card". 

5. If proxy server S2 has sent a message to a server S4 and 
server S4 has created a response M2 to message Ml to 
be sent to the user, then server S4 transmits the 
response M2 to the proxy server S2 using normal 
network point-o-point connections. 

6. The proxy server S2, upon receipt of the response M2, 
creates a return message Mr comprising the response 
M2 embedded in the return envelope set that was 
earlier transmitted to proxy server S2 by the user in the 
original message M. It transmits the return message Mr 
along the pseudonymous mix path specified by this 
return envelope set, so that the response M2 reaches the 
user at the user's client processor C3. 

7. The response M2 may contain a request for electronic 
payment to the information server S4. The user may 
then respond by means of a message M3 transmitted by 
the same means as described for message Ml above, 
which message M3 encloses some form of anonymous 
payment. Alternatively, the proxy server may respond 
automatically with such a payment, which is debited 
from an account maintained by the proxy server for this 
user. 

8. Either the response message M2 from the information 
server S4 to the user, or a subsequent message sent by 
the proxy server S2 to the user, may contain advertising 
material that is related to the user's request and/or is 
targeted to the user. Typically, if the user has just 
retrieved a target object X, then (a) either proxy server 
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S2 or information server S4 determines a weighted set la general, the user requests access to a particular target 
of advertisements that are "associated with" target object or menu of target objects; once the corresponding file 
object X (b) a subset of this set is chosen randomly, has been transmitted to the user's client processor, the user 
where the weight of an advertisement is proportional to views its contents and makes another such request, and so 
the probability that it is included in the subset, and (c) 5 on . Each request may take many seconds to satisfy, due to 
proxy server S2 selects from this subset just those retrieval and transmission delays. However, to the extent 
advertisements that the user is most likely to be inter- that tDe sequence of requests is predictable, the system for 
ested in. In the variation where proxy server S2 deter- customized electronic identification of desirable objects can 
mines the set of advertisements associated with target respond more quickly to each request, by retrieving or 
object X, then this set typically consists of all adver- 1Q slarting to re trieve the appropriate files even before the user 
tisements that the proxy server's owner has been paid reques|s lhenl . ^ ear i y retrieval is termed "pre-fetching of 
to disseminate and whose target profiles are within a file$ „ M eaflier lhe pre sent system also enables 
threshold similarity distance of the target profile of ^ automatically ranked hyperlinks in accordance 
target object X. In the vanaUon where S4 * user^e. By combining 
determines the set of advertisements associated with , . J ^ j ■> 
target object X advertisers typically purchase the right « this approach wi* pref etching jj ^fn^ 
to include advertisements in this set In either case, the files prefetching has already be*n initialed) overall predic- 
weight of an advertisement is determined by the tion of the next user action is further -enhanced, 
amount that an advertiser is willing to pay. Following Pre-fetching of locally stored data has been heavily stud- 
step (c), proxy server S2 retrieves the selected adver- ied in memory hierarchies, including CPU caches and sec- 
tising material and transmits it to the user's client 20 ondary storage (disks), for several decades. A leader in this 
processor C3, where it will be displayed to the user, area has been A. J. Smith of Berkeley, who identified a 
within a specified length of time after it is received, by variety of schemes and analyzed opportunities using exten- 
a trusted process running on the user's client processor sive traces in both databases and CPU caches. His conclu- 
C3. When proxy server S2 transmits an advertisement, s ion was that general schemes only really paid off where 
it sends a message to the advertiser, indicating that the 1S there was some reasonable chance that sequential access was 
advertisement has been transmitted to a user with a occurring, e.g., in a sequential read of data. As the balances 
particular predicted level of interest. The message may between various latencies in the memory hierarchy shifted 
also indicate the identity of target object X. In return, during the late 1980's and early 1990's, J. M. Smith and 
the advertiser may transmit an electronic payment to olhefS identified further opportunities for pre-fetching of 
proxy server S2; proxy server S2 retains a service fee 3Q botfa locaUy stQred data ^ networ k data. In particular, 
for itself, optionally forwards a service fee to informa- ^ ^ . q£ mms ^ wor k by Blaha showed the 
tion server S4 and the balance is forwarded to the user ' ^ f ^ rt tems for deep pattern analysis 
or used to credit the user's account on the proxy server. P ^ ^ be ^ J _^ tching Work F £ y L M . Smith 

relevance feedback information, digitally signed by work addressed the case of data on the ^-^^ 

client processor C3 with a proprietary private key where the large size of images and me long latencies provide 

SK~, is periodically transmitted through an a secure extra incentive to pre-fetch; Touch s technique is to pre-send 

mix path to the proxy server S2, whereupon the search 40 when large bandwidths permit some speculation using 

profile generation module 202 resident on server S2 HTML storage references embedded in WEB pages, and the 

updates the appropriate target profile interest summary Berkeley work uses techniques similar to J. M. Smith's 

associated with pseudonym P, provided that the signa- reference histories specialized to the semantics of HTML 

ture on the summary message can be authenticated with data. 

the corresponding public key PK^ which is available 45 Successful pre-fetching depends on the ability of the 

to all tabulating process that are ensured to have system to predict the next action or actions of the user. In the 

integrity. context of the system for customized electronic identifica- 

When a consumer enters into a financial relationship with tion of desirable objects, it is possible to cluster users into 

a particular information server based on both parties agree- groups according to the similarity of their user profiles. Any 

ing to terms for the relationship, a particular pseudonym 50 of the well-known pre-fetching methods that collect and 

may be extended for the consumer with respect to the given utilize aggregate statistics on past user behavior, in order to 

provider as detailed in the previous section. When entering predict future user behavior, may then be implemented in so 

into such a relationship, the consumer and the service as to collect and utilize a separate set of statistics for each 

provider agree to certain terms. However, if the user violates cluster of users. In this way, the system generalizes its access 

the terms of this relationship, the service provider may 55 pattern statistics from each user to similar users, without 

decline to provide service to the pseudonym under which it generalizing among users who have substantially different 

transacts with the user. In addition, the service provider has interests. The system may further collect and utilize a similar 

the recourse of refusing to provide resolution credentials to set of statistics that describes the aggregate behavior of all 

the pseudonym, and may choose to do so until the pseud- users; in cases where the system cannot confidently make a 

onym bearer returns to good standing. 60 prediction as to what a particular user will do, because the 

Pre-Fetching of Target Objects relevant statistics concerning that user's user cluster are 

In some circumstances, a user may request access in derived from only a small amount of data, the system may 

sequence to many files, which are stored on one or more instead make its predictions based on the aggregate statistics 

information servers. This behavior is common when navi- for all users, which are derived from a larger amount of data, 

gating a hypertext system such as the World Wide Web, or 65 For the sake of concreteness, we now describe a particular 

when using the target object browsing system described instantiation of a pre-fetching system, that both employs 

t, e l ow these insights and that makes its pre-fetching decisions 
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through accurate measurement of the expected cost and 
benefit of each potential pre-fetch. 

Pre-fetching exhibits a cost-benefit tradeoff. Let t denote 
the approximate number of minutes that pre-fetched files are 
retained in local storage (before tbey are deleted to make 5 
room for other pre-fetched files). If the system elects to 
prc-fetch a file corresponding to a target object X, then the 
user benefits from a fast response at no extra cost, provided 
that the user explicitly requests target object X soon there- 
after. However, if the user does not request target object X 10 
within t minutes of the pre-fetch, then the pre-fetch was 
worthless, and its cost is an added cost that must be borne 
(directly or indirectly) by the user. The first scenario there- 
fore provides benefit at no cost, while the second scenario 
incurs a cost at no benefit. The system tries to favor the first 15 
scenario by pre-fetching only those files that the user will 
access anyway. Depending on the user's wishes, the system 
may pre-fetch either conservatively, where it controls costs 
by pre-fetching only files that the user is extremely likely to 
request explicitly (and that are relatively cheap to retrieve), 20 
or more aggressively, where it also pre-fetches files that the 
user is only moderately likely to request explicitly, thereby 
increasing both the total cost and (to a lesser degree) the total 
benefit to the user. 

In the system described herein, pre-fetching for a user U 25 
is accomplished by the user's proxy server S. Whenever 
proxy server S retrieves a user-requested file F from an 
information server, it uses the identity of this file F and the 
characteristics of the user, as described below, to identify a 
group of other files Gl . . . Gk that the user is likely to access 30 
soon. The user's request for file F is said to "trigger" files Gl 
. . . Gk. Proxy server S pre-fetches each of these triggered 
files Gi as follows: 

1. Unless file Gi is already stored locally (e.g., due to 
previous pre-fetch), proxy server S retrieves file Gi 
from an appropriate information server and stores it 
locally. 

2. Proxy server S tim est amps its local copy of file Gi as 
having just been pre-fetched, so that file Gi will be 
retained in local storage for a minimum of approxi- 
mately t minutes before being deleted. 

Whenever user U (or, in principle, any other user registered 
with proxy server S) requests proxy server S to retrieve a file 
that has been pre-fetched and not yet deleted, proxy server 
S can then retrieve the file from local storage rather than 
from another server. In a variation on steps 1-2 above, proxy 
server S pre-fetches a file Gi somewhat differently, so that 
pre-fetched files are stored on the user's client processor q 
rather than on server S: 

1. If proxy server S has not pre-fetched file Gi in the past 
t minutes, it retrieves file Gi and transmits it to user 
U'sclient processor q. 

2. Upon receipt of the message sent in step 1, client q 
stores a local copy of file Gi if one is not currently 55 
stored. 

3. Proxy server S notifies client q that client q should 
timestamp its local copy of file Gi; this notification may 
be combined with the message transmitted in step 1, if 
any. 

4. Upon receipt of the message sent in step 3, client q 
timestamps its local copy of file Gi as having just been 
pre-fetched, so that file Gi will be retained in local 
storage for a minimum of approximately t minutes 
before being deleted. 

During the period that client q retains file Gi in local storage, 
client q can respond to any request for file Gi (by user U or, 
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in principle, any other user of client q) immediately and 
without the assistance of proxy server S. 

The difficult task is for proxy server S, each time it 
retrieves a file F in response to a request, to identify the files 
Gl . . . Gk that should be triggered by the request for file F 
and pre-fetched immediately. Proxy server S employs a 
cost-benefit analysis, performing each pre-fetch whose ben- 
efit exceeds a user-determined multiple of its cost; the user 
may set the multiplier low for aggressive prefetching or high 
for conservative prefetching. These pre-fetches may be 
performed in parallel. The benefit of pre-fetching file Gi 
immediately is defined to be the expected number of seconds 
saved by such a pre-fetch, as compared to a situation where 
Gi is left to be retrieved later (either by a later pre-fetch, or 
by the user's request) if at all. The cost of pre-fetching file 
Gi immediately is defined to be the expected cost for proxy 
server S to retrieve file Gi, as determined for example by the 
network locations of server S and file Gi and by information 
provider charges, times 1 minus the probability that proxy 
server S will have to retrieve file Gi within t minutes (to 
satisfy either a later pre-fetch or the user's explicit request) 
if it is not pre-fetched now. 

The above definitions of cost and benefit have some 
attractive properties. For example, if users tend to retrieve 
either file Fl or file F2 (say) after file F, and tend only in the 
former case to subsequently retrieve file Gl, then the system 
will generally not pre-fetch Gl immediately after retrieving 
file F: for, to the extent that the user is likely to retrieve file 
F2, the cost of the pre-fetch is high, and to the extent that the 
user is likely to retrieve file Fl instead, the benefit of the 
pre-fetch is low, since the system can save as much or nearly 
as much time by waiting until the user chooses Fl and 
pre-fetching Gl only then. 

The proxy server S may estimate the necessary costs and 
benefits by adhering to the following discipline: 

1 . Proxy server S maintains a set of disjoint clusters of the 
users in its user base, clustered according to their user 
profiles. 

2. Proxy server S maintains an initially empty set PFT of 
"pre-fetch triples" <C,F,G>, where F and G are files, 
and where C identifies either a cluster of users or the set 
of all users in the user base of proxy server S. Each 
pre-fetch triple in the set PFT is associated with several 
stored values specific to that triple. Pre-fetch triples and 
their associated values are maintained according to the 
rules in 3 and 4. 

3. Whenever a user U in the user base of proxy server S 
makes a request R2 for a file G, or a request R2 that 
triggers file G, then proxy server S takes the following 
actions: 

a. For C being the user cluster containing user U, and 
then again for C being the set of all users: 

b. For any request RO for a file, say file F, made by user 
U during the t minutes strictly prior to the request 
R2: 

c. If the triple <C J\G> is not currently a member of the 
set PFT, it is added to the set PFT with a count of 0, 
a trigger-count of 0, a target-count of 0, a total 
benefit of 0, and a timestamp whose value is the 
current date and time. 

d. The count of the triple <C,F,G> is increased by one. 

e. If file G was not triggered or explicitly retrieved by 
any request that user U made strictly in between 
requests R0 and R2, then the target-count of the 
triple <C,F,G> is increased by one. 

f. If request R2 was a request for file G, then the total 
benefit of triple <C,F,G> is increased either by the 
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time elapsed between request R0 and request R2, or 
by the expected lime to retrieve file G, whichever is 
less. 

g. If request R2 was a request for file G, and G was 
triggered or explicitly retrieved by one or more 5 
requests that user U made strictly in between 
requests R0 and R2, with Rl denoting the earliest 
such request, then the total benefit of triple <C,F,G> 
is decreased either by the time elapsed between 
request Rl and request R2, or by the expected time 10 
to retrieve file G, whichever is less. 

4. If a user U requests a file F, then the trigger-count is 
incremented by one for each triple currently in the set 
PFT such that the triple has form <C,F,G>, where user 

U is in the set or cluster identified by C. 15 

5. The "age" of a triple <C,F,G> is defined to be the 
number of days elapsed between its timestamp and the 
current date and time. If the age of any triple <C,F,G> 
exceeds a fixed constant number of days, and also 
exceeds a fixed constant multiple of the triple 's count, 
then the triple may be deleted from the set PFT. 

Proxy server S can therefore decide rapidly which files G 
should be triggered by a request for a given file F from 
a given user U, as follows. ^ 

1. Let CO be the user cluster containing user U, and CI be 
the set of all users. 

2. Server S constructs a list L of all triples <C0 ,F,G> such 
that <CO,F,G> appears in set PFT with a count exceed- 
ing a fixed threshold. 30 

3. Server S adds to list L all triples <C1,F,G> such that 
<C0,F,G> does not appear on list L and <C1,F,G> 
appears in set PFT with a count exceeding another fixed 
threshold. 

4. For each triple <C,F,G> on list L: 35 

5. Server S computes the cost of triggering file G to be 
expected cost of retrieving file Gi, times 1 minus the 
quotient of the target-count of <C,F,G> by the trigger- 
count of <C,F,G>. ^ 

6. Server S computes the benefit of triggering file G to be 
the total benefit of <C,F,G> divided by the count of 
<C,F,G>. 

7. Finally, proxy server S uses the computed cost and 
benefit, as described earlier, to decide whether file G 45 
should be triggered. The approach to pre-fetching just 
described has the advantage that all data storage and 
manipulation concerning pre-fetching decisions by 
proxy server S is handled locally at proxy server S. 
However, this "user-based" approach does lead to 50 
duplicated storage and effort across proxy servers, as 
well as incomplete data at each individual proxy server. 
That is, the information indicating what files are fre- 
quently retrieved after file F is scattered in an uncoor- 
dinated way across numerous proxy servers. An 55 
alternative, "file-based" approach is to store all such 
information with file F itself The difference is as 
follows. In the user-based approach, a pre-fetch triple 
<C,F,G> in server S's set PFT may mention any file F 
and any file G on the network, but is restricted to 60 
clusters C that are subsets of the user base of server S. 
By contrast, in the file-based approach, a pre-fetcb 
triple <C,F,G> in server S's set PFT may mention any 
user cluster C and any file G on the network, but is 
restricted to files F that are stored on server S. (Note 65 
that in the file-based approach, user clustering is net- 
work wide, and user clusters may include users from 
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different proxy servers.) When a proxy server S2 sends 
a request to server S to retrieve file F for a user U, 
server S2 indicates in this message the user U's user 
cluster CO, as well as the user U's value for the 
user-determined multiplier that is used in cost-benefit 
analysis. Server S can use this information, together 
with all its triples in its set PFT of the form <C0,F,G> 
and <C1,F,G>, where CI is the set of all users every- 
where on the network, to determine (exactly as in the 
user-based approach) which files Gl . . . Gk are 
triggered by the request for file F. When server S sends 
file F back to proxy server S2, it also sends this list of 
files Gl. . . Gk, so that proxy server S2 can proceed to 
pre-fetch files Gl . . . Gk. 
The file-based approach requires some additional data 
transmission. Recall that under the user-based approach, 
server S must execute steps 3c-3g above for any ordered 
pair of requests R0 and R2 made within t minutes of each 
other by a user who employs server S as a proxy server. 
Under the file-based approach, server S must execute steps 
3c-3g above for any ordered pair of requests R0 and R2 
made within t minutes of each other, by any user on the 
network, such that R0 requests a file stored on server S. 
Therefore, when a user makes a request R2, the user's proxy 
server must send a notification of request R2 to all servers 
S such that, during the preceding t minutes (where the 
variable t may now depend on server S), the user has made 
a request RO for a file stored on server S. This notification 
need not be sent immediately, and it is generally more 
efficient for each proxy server to buffer up such notifications 
and send them periodically in groups to the appropriate 
servers. 

Access And Reachability Control of Users and User-Specific 
Information 

Although users' true identities are protected by the use of 
secure mix paths, pseudonymity does not guarantee com- 
plete privacy. In particular, advertisers can in principle 
employ user-specific data to barrage users with unwanted 
solicitations. The general solution to this problem is for 
proxy server S2 to act as a representative on behalf of each 
user in its user base, permitting access to the user and the 
user's private data only in accordance with criteria that have 
been set by the user. Proxy server S2 can restrict access in 
two ways: 

1 . The proxy server S2 may restrict access by third parties 
to server S2*s pseudonymous database of user-specific 
information. When a third party such as an advertiser 
sends a message to server S2 requesting the release of 
user-specific information for a pseudonym P, server S2 
re fuses to honor the request unless the message 
includes credentials for the accessor adequate to prove 
that the accessor is entitled to this information. The user 
associated with pseudonym P may at any time send 
signed control messages to proxy server S2, specifying 
the credentials or Boolean combinations of credentials 
that proxy server S2 should thenceforth consider to be 
adequate grounds for releasing a specified subset of the 
information associated with pseudonym P. Proxy server 
S2 stores these access criteria with its database record 
for pseudonym P. For example, a user might wish to 
proxy server S2 to release purchasing information only 
to selected information providers, to charitable organi- 
zations (that is, organizations that can provide a 
government-issued credential that is issued only to 
registered charities), and to market researchers who 
have paid user U for the right to study user U's 
purchasing habits. 
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2. The proxy server S2 may restrict the ability of third 
parties to send electronic messages to the user. When a 
third party such as an advertiser attempts to send 
information (such as a textual message or a request to 
enter into spoken or written real -time communication) 5 
to pseudonym P, by sending a message to proxy server 
S2 requesting proxy server S2 to forward the informa- 
tion to the user at pseudonym P, proxy server S2 will 
refuse to honor the request, unless the message includes 
credentials for the accessor adequate to meet the 10 
requirements the user has chosen to impose, as above, 
on third parties who wish to send information to the 
user. If the message does include adequate credentials, 
then proxy server S2 removes a single-use pseudony- 
mous return address envelope from it s database record 15 
for pseudonym P, and uses the envelope to send a 
message containing the specified information along a 
secure mix path to the user of pseudonym P. If the 
envelope being used is the only envelope stored for 
pseudonym P, or more generally if the supply of such 20 
envelopes is low, proxy server S2 adds a notation to this 
message before sending it, which notation indicates to 
the user's local server that it should send additional 
envelopes to proxy server S2 for future use. 
Id a more general variation, the user may instruct the 25 
proxy server S2 to impose more complex requirements on 
the granting of requests by third parties, not simply Boolean 
combinations of required credentials. The user may impose 
any Boolean combination of simple requirements that may 
include, but are not limited to, the following: 30 
(a.) the accessor (third party) is a particular party 
(b.) the accessor has provided a particular credential 
(c.) satisfying the request would involve disclosure to the 

accessor of a certain fact about the user's user profile 35 
(d.) satisfying the request would involve disclosure to the 
accessor of the user's target profile interest summary 
(e.) satisfying the request would involve disclosure to the 
accessor of statistical summary data, which data are 
computed from the useres user profile or target profile 40 
interest summary together with the user profiles and 
target profile interest summaries of at least n other users 
in the user base of the proxy server 
(f.) the content of the request is to send the user a target 
object, and this target object has a particular attribute 45 
(such as high reading level, or low vulgarity, or an 
authenticated Parental Guidance rating from the 
MPAA) 

(g.) the content of the request is to send the user a target 
object, and this target object has been digitally signed 
with a particular private key (such as the private key 
used by the National Pharmaceutical Association to 
certify approved documents) 

(h.) the content of the request is to send the user a target 55 
object, and the target profile has been digitally signed 
by a profile authentication agency, guaranteeing that 
the target profile is a true and accurate profile of the 
target object it claims to describe, with all attributes 
authenticated. 60 

(i.) the content of the request is to send the user a target 
object, and the target profile of this target object is 
within a specified distance of a particular search profile 
specified by the user 

(j.) the content of the request is to send the user a target 65 
object, and the proxy server S2,.by using the user's 
stored target profile interest summary, estimates the 
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user's likely interest in the target object to be above a 

specified threshold 
(k.) the accessor indicates its willingness to make a 

particular payment to the user in exchange for the 

fulfillment of the request 
The steps required to create and maintain the user's 
access-control requirements are as follows: 

1. The user composes a Boolean combination of predi- 
cates that apply to requests; the resulting complex 
predicate should be true when applied to a request that 
the user wants proxy server S2 to honor, and false 
otherwise. The complex predicate may be encoded in 
another form, for efficiency. 

2. The complex predicate is signed with SKp, and trans- 
mitted from the user's client processor C3 to the proxy 
server S2 through the mix path enclosed in a packet that 
also contains the user's pseudonym P. 

3. The proxy server S2 receives the packet, verifies its 
authenticity using PK^ and stores the access control 
instructions specified in the packet as part of its data- 
base record for pseudonym P. The proxy server S2 
enforces access control as follows: 

1. The third party (accessor) transmits a request to proxy 
server S2 using the normal point-to-point connections 
provided by the network N. The request may be to 
access the target profile interest summaries associated 
with a set of pseudonyms PI ... Pn, or to access the 
user profiles associated with a set of pseudonyms PI . 
. . Pn, or to forward a message to the users associated 
with pseudonyms PI . . . Pn. The accessor may explic- 
itly specify the pseudonyms PI ... Pn, or may ask that 
PI ... Pn be chosen to be the set of all pseudonyms 
registered with proxy server S2 that meet specified 
conditions. 

2. The proxy server S2 indexes the database record for 
each pseudonym Pi (l<=I<=n), retrieves the access 
requirements provided by the user associated with Pi, 
and determines whether and how the transmitted 
request should be satisfied for Pi. If the requirements 
are satisfied, S2 proceeds with steps 3a-3c. 

3a. If the request can be satisfied but only upon payment 
of a fee, the proxy server S2 transmits a payment 
request to the accessor, and waits- for the accessor to 
send the payment to the proxy server S2. Proxy server 
S2 retains a service fee and forward s the balance of the 
payment to the user associated with pseudonym Pi, via 
an anonymous return packet that this user has provided. 

3b. If the request can be satisfied but only upon provision 
of a credential, the proxy server S2 transmits a creden- 
tial request to the accessor, and waits for the accessor 
to send the credential to the proxy server S2. 

3c. The proxy server S2 satisfies the request by disclosing 
user-specific information to the accessor, by providing 
the accessor with a set of single-use envelopes to 
communicate directly with the user, or by forwarding a 
message to the user, as requested. 

4. Proxy server S2 optionally sends a message to the 
accessor, indicating why each of the denied requests for 
PI . . . Pn was denied, and/or indicating how many 
requests were satisfied. 

5. The active and/or passive relevance feedback provided 
by any user U with respect to any target object sent by 
any path from the accessor is tabulated by the above- 
described tabulating process resident on user IPs client 
processor C3. As described above, a summary of such 
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information is periodically transmitted to the proxy 
server S2 to enable the proxy server S2 to update that 
user's target profile interest summary and user profile. 
The access control criteria can be applied to solicited as 
well as unsolicited transmissions. That is, the proxy server 
can be used to protect the user from inappropriate or 
misrepresented target objects that the user may request. If 
the user requests a target object from an information server, 
but the target object turns out not to meet the access control 
criteria, then the proxy server will not permit the information 
server to transmit the target object to the user, or to charge 
the user for such transmission. For example, to guard against 
target objects whose profiles have been tampered with, the 
user may specify an access control criterion that requires the 
provider to prove the target profile's accuracy by means of 
a digital signature from a profile authentication agency. As 
another example, the parents of a child user may instruct the 
proxy server that only target objects that have been digitally 
signed by a recognized child protection organization may be 
transmitted to the user; thus, the proxy server will not let the 
user retrieve pornography, even from a rogue information 
server that is willing to provide pornography to users who 
have not supplied an adulthood credential. 
Distribution of Information with Multicast Trees 

The graphical representation of the network N presented 
in FIG. 3 shows that at least one of the data communications 
links can be eliminated, as shown in FIG. 4, while still 
enabling the network N to transmit messages among all the 
servers A-D. By elimination, we mean that the link is 
unused in the logical design of the network, rather than a 
physical disconnection of the link. The graphs that result 
when all redundant data communications links are elimi- 
nated are termed "trees" or "connected acyclic graphs.** A 
graph where a message could be transmitted by a server 
through other servers and then return to the transmitting 
server over a different originating data communications link 
is termed a "cycle." A tree is thus an acyclic graph whose 
edges (links) connect a set of graph "nodes" (servers). The 
tree can be used to efficiently broadcast any data file to 
selected servers in a set of interconnected servers. 

The tree structure is attractive in a communications net- 
work because much information distribution is multicast in 
nature — that is, a piece of information available at a single 
source must be distributed to a multiplicity of points where 
the information can be accessed. This technique is widely 
known: for example, "FAX trees" are in common use in 
political organizations, and multicast trees are widely used 
in distribution of multimedia data in the Internet; for 
example, see "Scaleable Feedback Control for Multicast 
Video Distribution in the Internet," (Jean-Chrysostome 
Bolot, Thierry Turletti, & Ian Wake man, Computer Com- 
munication Review, Vol. 24, #4, October, '94, Proceedings 
of SIGCOMM'94, pp. 58-67) or "An Architecture For 
Wide-Area Multicast Routing," (Stephen Deering, Deborah 
Estrin, Dino Farinacci, Van Jacobson, Ching-Gung Liu, & 
Liming Wei, Computer Communication Review, Vol. 24, #4, 
October, '94, Proceedings of SIGCOMM'94, pp. 126-135). 
While there are many possible trees that can be overlaid on 
a graph representation of a network, both the nature of the 
networks (e.g., the cost of transmitting data over a link) and 
their use (for example, certain nodes may exhibit more 
frequent intercommunication) can make one choice of tree 
better than another for use as a multicast tree. One of the 
most difficult problems in practical network design is the 
construction of "good" multicast trees, that is, tree choices 
which exhibit low cost (due to data not traversing links 
unnecessarily) and good performance (due to data frequently 
being close to where it is needed) 
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Constructing a Multicast Tree 

Algorithms for constructing multicast trees have either 
been ad-hoc, as is the case of the Deering, ct al. Internet 
multicast tree, which adds clients as they request service by 
grafting them into the existing tree, or by construction of a 
minimum cost spanning tree. A distributed algorithm for 
creating a spanning tree (defined as a tree that connects, or 
"spans," all nodes of the graph) on a set of Ethernet bridges 
was developed by Radia Perlman ("Interconnections: 
Bridges and Routers," Radia Perlman, Addison-Wesley, 
1992). Creating a minimal -cost spanning tree for a graph 
depends on having a cost model for the arcs of the graph 
(corresponding to communications links in the communica- 
tions network). In the case of Ethernet bridges, the default 
cost (more complicated costing models for path costs are 
discussed on pp. 72-73 of Perlman) is calculated as a simple 
distance measure to the root; thus the spanning tree mini- 
mizes the cost to the root by first electing a unique root and 
then constructing a spanning tree based on the distances 
from the root. In this algorithm, the root is elected by 
recourse to a numeric ID contained in "configuration mes- 
sages": the server w hose ID has minimum numeric value is 
chosen as the root. Several problems exist with this algo- 
rithm in general. First, the method of using an ID does not 
necessarily select the best root for the nodes interconnected 
in the tree. Second, the cost model is simplistic. 

We first show how to use the similarity-based methods 
described above to select the servers most interested in a 
group of target objects, herein termed "core servers" for that 
group. Next we show how to construct an unrooted multicast 
tree that can be used to broadcast files to these core servers. 
Finally, we show how files corresponding to target objects 
are actually broadcast through the multicast tree at the 
initiative of a client, and how these files are later retrieved 
from the core servers when clients request them. 

Since the choice of core servers to distribute a file to 
depends on the set of users who are likely to retrieve the file 
(that is, the set of users who are likely to be interested in the 
corresponding target object), a separate set of core servers 
and hence a separate multicast tree may be used for each 
topical group of target objects. Throughout the description 
below, servers may communicate among themselves 
through any path over which messages can travel; the goal 
of each multicast tree is to optimize the multicast distribu- 
tion of files corresponding to target objects of the corre- 
sponding topic. Note that this problem is completely distinct 
from selecting a multiplicity of spanning trees for the 
complete set of interconnected nodes as disclosed by Sin- 
coskie in U.S. Pat. No. 4,706,080 and the publication titled 
"Extended Bridge Algorithms for Large Networks" by W. D. 
Sincoskie and C. J. Cotton, published January 1988 in IEEE 
Network on pages 16-24. The trees in this disclosure are 
intentionally designed to interconnect a selected subset of 
nodes in the system, and are successful to the degree that this 
subset is relatively small. 
Multicast Tree Construction Procedure 

A set of topical multicast trees for a set of homogenous 
target objects may be constructed or reconstructed at any 
time, as follows. The set of target objects is grouped into a 
fixed number of topical clusters CI . . . Cp with the methods 
described above, for example, by choosing CI ... Cp to be 
the result of a k-means clustering of the set of target objects, 
or alternatively a covering set of low-level clusters from a 
hierarchical cluster tree of these target objects. A multicast 
tree MT(c) is then constructed from each cluster C in CI . . . 
Cp, by the following procedure: 

1. Given a set of proxy servers, SI . . . Sn, and a topical 
cluster C. It is assumed that a general multicast tree MT^„ 
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that contains all the proxy servers SI ... Sn has previously (a) Proxy server Si randomly selects a target object T from 

been constructed by well-known methods. cluster C. 

2. Each pair <Si, C> is associated with a weight, w(Si, C), G>) *°*y *™ Si applies the techniques disclosed above 

l • _■ -.i. .u .j \J r to us stored aggregate target profile interest summary in 

which is intended to covary with the expected number of ^ to ^ & ^ ^ ^ ^ 

users in the user base of proxy server Si who will subse- aggregated user base had in the selected target object T, 

quently access a target object from cluster C. This weight is ^ Qew; ^ may be imerpreted ^ M estimate of the 

computed by proxy server Si in any of several ways, all of likelihood that at least one member of the user base will 
which make use of the simdanty measurement computation reldeve a Qew tafgel Qbject sim{h[ tQ T 
described herein. iQ ^ s{ repeats sleps ( A yQ>) f or several target 
One variation makes use of the following steps: (a) Proxy objects T selected randomly from cluster C, and aver- 
server Si randomly selects a target object T from cluster C. ages me values 0 f w(Si, T) thereby computed in 
(b) For each pseudonym in its local database, with associ- gtep ^ to determine lbe desired quantity w(Si, C), 
ated user U, proxy server Si applies the techniques disclosed which quantity represents the expected aggregate inter- 
above to user U's stored user profile and target profile 15 ^ by me ^ basc of proxy server Si in the target 
interest summary in order to estimate the interest w(U, T) objects of cluster C. 

that user U has in t he selected target object T. The aggregate 3 Si % om among si . . . Sn with the 

interest w(Si, T) that the user base of proxy server Si has id greatest weights w(Si, C) are designated "core servers" for 

the target object T is defined to be the sum of these interest duster c In one varia ti on , wrjere i t is desired to select a 

values w(U,T). Alternatively, w(Si,T) may be defined to be 2Q fixed number of core servers, those servers Si with the 

the sum of values s(w(U, T)) over all U in the user base. greatest values of w(Si, Q are selected. In another variation, 

Here s(*) is a sigmoidal function that is close to 0 for small me yalue of w ^ g fof each ^ si ^ compared against 

arguments and close to a constant P max for large arguments; a fixed threshold w^, and those servers Si such that w(Si, 

thus s(w(U, T)) estimates the probability that user U will ^ Qr exceeds w ^ m selected as core servers. If 

access target object T, which probability is assumed to be 25 duster c re p resents a narrow and specialized set of target 

independent of the probability that any other user will access objects, as often happens when the clusters CI ... Cp arc 

target object T. In a variation, w(Si, T) is made to estimate numerous, it is usually adequate to select only a small 

the probability that at least one user from the user base of Si numD er of core server cluster C, thereby obtaining substao- 

will access target object T: then w(Si, T) may be denned as ^ advantages in computational efficiency in steps 4-5 

the maximum of values w(U, T), or of 1 minus the product 3Q De j ow 

over the users U of the quantity (l-s(w(U, T))). (c)Proxy 4 A complete graph G(C) is constructed whose vertices 

server Si repeats steps (a)-0>) for several target objects T afe ^ desigrjated core servers for cluster C. For each pair 

selected randomly from cluster C, and averages the several of core me ^ of transmitting a message between 

values of w(Si, T) thereby computed in step (b) to determine mose CQrE along the cheapest path is estimated, and 

the desired quantity w(Si, Q, which quantity represents the 35 me weight of me ^ connecting those core servers is taken 

expected aggregate interest by the user base of proxy server to be this cost. The cost is determined as a suitable function 

Si in the target objects of cluster C. of avcra ge transmission charges, average transmission delay, 

In another variation, where target profile interest summa- an d worst -case or near-worst-case transmission delay, 
ries are embodied as search profile sets, the following 5. The multicast tree MT(C) is computed by standard 
procedure is followed to compute w(Si, C): (a). For each 40 methods to be the minimum spanning tree (or a near- 
search profile Ps in the locally stored search profile set of minimum spanning tree) for G(C), where the weight of an 
any user in the user base of proxy server Si, proxy server Si edge between two core servers is taken to be the cost of 
computes the distance d(P St P c ) between the search profile transmitting a message between those two core servers. Note 
and the cluster profile P c of cluster C. (b). w(Si,Q is chosen that MT(C) does not contain as vertices all proxy servers 
to be the maximum value of (-d(P s ^ c )/f) across all such 45 SI . . . Sn, but only the core servers for cluster C. 
search profiles P^, where r is computed as an afline function 6. a message M is formed describing the cluster profile 
of the cluster diameter of cluster C. The slope and/or f or cluster C, the core servers for cluster C and the topology 
intercept of this affine function are chosen to be smaller Q f the multicast tree MT(C) constructed on those core 
(thereby increasing w(Si, C)) for servers Si for which the servers. Message M is broadcast to all proxy servers SI . . . 
target object provider wishes to improve performance, as 50 Sn by means of the general multicast tree MT^,/. Each proxy 
may be the case if the users in the user base of proxy server server Si, upon receipt of message M, extracts the cluster 
Si pay a premium for improved performance, or if perfor- profile of cluster C, and stores it on a local storage device, 
mance at Si will otherwise be unacceptably low due to slow together with certain other information that it determines 
network connections. from message M, as follows. If proxy server Si is named in 

In another variation, the proxy server Si is modified so 55 message M as a core server for cluster C, then proxy server 

that it maintains not only target profile interest summaries Si extracts and stores the subtree of MT(Q induced by all 

for each user in its user base, but also a single aggregate core servers whose path distance from Si in the graph MT(C) 

target profile interest summary for the entire user base. This is less than or equal to d, where d is a constant positive 

aggregate target profile interest summary is determined in integer (usually from 1 to 3). If message M does not name 

the usual way from relevance feedback, but the relevance 60 proxy server Si as a core server for MT(Q, then proxy server 

feedback on a target object, in this case, is considered to be Si extracts and stores a list of one or more nearby core 

the frequency with which users in the user base retrieved the servers that can be inexpensively contacted by proxy server 

target object when it was new. Whenever a user retrieves a Si over virtual point-to-point links. 

target object by means of a request to proxy server Si, the In the network of FIG. 3, to illustrate the use of trees, as 

aggregate target profile interest summary for proxy server Si 65 applied to the system of the present invention, consider the 

is updated. In this variation, w(Si, Q I s estimated by the following simple example where it is assumed that client r 

following steps: provides on-line information for the network, such as an 
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electronic newspaper. This information can be structured by 
client r into a prearranged form, comprising a number of 
files, each of which is associated with a different target 
object. In the case of an electronic newspaper, the files can 
contain textual representations of stock prices, weather 5 
forecasts, editorials, etc. The system determines likely 
demand for the target objects associated with these files in 
order to optimize the distribution of the files through the 
network N of interconnected clients p-s and proxy servers 
A-D. Assume that cluster C consists of text articles relating 10 
to the aerospace industry; further assume that the target 
profile interest summaries stored at proxy servers A and B 
for the users at clients p and r indicate that these users are 
strongly interested in such articles. Then the proxy servers A 
and B are selected as core servers for the multicast tree IS 
MT(Q. The multicast tree MT(C) is then computed to 
consist of the core servers, A and B, connected by an edge 
that represents the least costly virtual point-to-point link 
between A and B (either the direct path A-B or the indirect 
path A-C-B, depending on the cost). 20 
Global Requests to Multicast Trees 

One type of message that may be transmitted to any proxy 
server S is termed a "global request message." Such a 
message M triggers the broadcast of an embedded request R 
to all core servers in a multicast tree MT(C). The content of 25 
request R and the identity of cluster C are included in the 
message M, as is a field indicating that message M is a 
global request message. In addition, the message M contains 
a field S last which is unspecified except under certain cir- 
cumstances described below, when it names a specific core 30 
server. A global request message M may be transmitted to 
proxy server S by a user registered with proxy server S, 
which transmission may take place along a pseudonymous 
mix path, or it may be transmitted to proxy server S from 
another proxy server, along a virtual point-to-point connec- 35 
tion. 

When a proxy server S receives a message M that is 
marked as a global request message, it acts as follows: 1. If 
proxy server S is not a core server for topic C, it retrieves its 
locally stored list of nearby core servers for topic C, selects 40 
from this list a nearby core server S', and transmits a copy 
of message M over a virtual point-to-point connection to 
core server S\ If this transmission fails, proxy server S 
repeats the procedure with other core servers on its list. 2. If 
proxy server S is a core server for topic C, it executes the 45 
following steps: (a) Act on the request R that is embedded 
in message M (b) Set S^ to be S(C) Retrieve the locally 
stored subtree of MT(C), and extract from it a list L of all 
core servers that are directly linked to S curr in this subtree, 
(d) If the message M specifies a value for S last and S lasl 50 
appears on the list L, remove S ta ^ t from the list L. Note that 
list L may be empty before this step, or may become empty 
as a result of this step, (e) For each server Si in list L, 
transmit a copy of message M from server S to server Si over 
a virtual point-to-point connection, where the S last field of 55 
the copy of message M has been altered to If Si cannot 
be reached in a reasonable amount of time by any virtual 
point-to-point connection (for example, server Si is broken), 
recurse to step (c) above with S^ bound to S curr and S Ctt ^ 
bound to S{\sub 1} for the duration of the recursion. 60 

When server S' in step 1 or a server Si in step 2(e) receives 
a copy of the global request message M, it acts according to 
exactly the same steps. As a result, all core servers eventu- 
ally receive a copy of global request message M and act on 
the embedded request R, unless some core servers cannot be 65 
reached. Even if a core server is unreachable, step (e) 
ensures that the broadcast can continue to other core servers 



195 

60 

in most circumstances, provided that d>l; higher values of 
d provide additional insurance against unreachable core 
servers. 

Multicast in i Files 

The system for customized electronic information of 
desirable objects executes the following steps in order to 
introduce a new target object into the system. These steps are 
initiated by an entity E, which may be either a user entering 
commands via a keyboard at a client processor q, as illus- 
trated in FIG. 3, or an automatic software process resident on 
a client or server processor q. 1. Processor q forms a signed 
request R, which asks the receiver to store a copy of a file 
F on its local storage device. File F, which is maintained by 
client q on storage at client q or on storage accessible by 
client q over the network, contains the informational content 
of or an identifying description of a target object, as 
described above. The request R also includes an address at 
which entity E may be contacted (possibly a pseudonymous 
address at some proxy server D), and asks the receiver to 
store the fact that file F is maintained by an entity at said 
address. 2. Processor q embeds request R in a message Ml, 
which it pseudonymously transmits to the entity E*s proxy 
server D as described above. Message Ml instructs proxy 
server D to broadcast request R along an appropriate mul- 
ticast tree. 3. Upon receipt of message Ml, proxy server D 
examines the doubly embedded file F and computes a target 
profile P for the corresponding target object. It compares the 
target profile P to each of the cluster profiles for topical 
clusters CI . . . Cp described above, and chooses Ck to be 
the cluster with the smallest similarity distance to profile P. 
4. Proxy server D sends itself a global request message M 
instructing itself to broadcast request R along the topical 
multicast tree MT(Ck). 5. Proxy server D notifies entity E 
through a pseudonymous communication that file F has been 
multicast along the topical multicast tree for cluster Ck. 

As a result of the procedure that server D and other 
servers follow for acting on global request messages, step 4 
eventually causes all core servers for topic Ck to act on 
request R and therefore store a local copy of file F. In order 
to make room for file F on its local storage device, a core 
server Si may have to delete a less useful file. There are 
several ways to choose a file to delete. One option, well 
known in the art, is for Si to choose to delete the least 
recently accessed file. In another variation, Si deletes a file 
that it believes few users will access. In this variation, 
whenever a server Si stores a copy of a file F, it also 
computes and stores the weight w(Si, C F ), where C F is a 
cluster consisting of the single target object associated with 
file F. Then, when server Si needs to delete a file, it chooses 
to delete the file F with the lowest weight w(Si, C^). To 
reflect the fact that files are accessed less as they age, server 
Si periodically multiplies its stored value of w(Si, C F ) by a 
decay factor, such as 0.95, for each file F that it then stores. 
Alter natively, instead of using a decay factor, server Si may 
periodically recompute aggregate interest w(Si, C F ) for each 
file F that it stores; the aggregate interest changes over time 
because target objects typically have an age attribute that the 
system considers in estimating user interest, as described 
above. 

If entity E later wishes to remove file F from the network, 
for example because it has just multicast an updated version, 
it pseudonymously transmits a digitally signed global 
request message to proxy server D, requesting all proxy 
servers in the multicast tree MT(Ck) to delete any local copy 
of file F that they may be storing. 
Queries to Multicast Trees 

In addition to global request messages, another type of 
message that may be transmitted to any proxy server S is 
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6 • a , „ „ m ,„ which must identify files to the user by (name, multicast 

termed a "query message." When transmuted to proxy which ^ ^ ^ a e M 

server, a query message causes a reply to be sent to the u>jhcj pai ^ 3 pn)cessor q 

Sn tor of the message; this reply wil contain an arcwe ^^oSly transmit message M to the user's proxy 

tofgivcnqut^Qif «yofltase^«»».^»»ta^ d Lribed above. 4. Processor q receives a 

W^QKiblcto^A"*^'*^^ response M2 to message M. 5. If the response M2 » 

tot no answer is available. He query and the cluster C are res^n« :ju namcs a ^ S that stall stores file F, 

named in the query message. In ^^.^ ue *"^£ ft™ processor q pseudonymously instructs the user's proxy 

contains a field ^^ZT^X^ file F from server S. If the retnevalfaUs 

certain circumstances described below, when . uumes a sen. § ^ ^ f ^red the 

specific core server. When a proxy se^er S jece-ves a * hM ^ qretunB » step 1. 6. If lhe .es™ *Q 

message M that is marked as a query message, it acts as q „ „ . ■ indicates that no server in MT(C) still 

foUows:l. Proxy server S sets A,* » be the return *J dress for ^ negatave ^ & Q that asks the 

the client or server that transmitted message M w^erS. sto address Aof the entity that mamtams file F, 

A, may be either a network address or a P»*W™" Sti y will ordinarily maintain a copy of file F mdefi- 

addrJl. If proxy server Sis not a core server ford^te « tosenUty ^ ^ 

C.itretrievesitslocaUystored^^^ SLIfon (unless instnicted to delete it by the maintaimng 

topic C, selects from this list a nearby core server S and mtorm y ^ ^ p for space s . 

transmits a copy of the locate message M over a virtual enU^e y a ^ pr0Vldmg 

point-to-point connection to core server S'. If this transmis- whereu p 0D processor q pseudonymously 

sfon f»ils P , proxy server S repeats the procedure wUh o*« 20 Mnu A, ^ P^ p ^ ^ Me p from 

core servers on its lisLUrx)nre«ivmg a reply, it forward 

reply to address A,. 3. If proxy server S is a core server address A- ^ ^ ^ ^ p ^ on local ^ 

for chJster C, and it is able to answer ^ Q^^fl thlghom todata communication network N but are no 
stored information, then it transmits a "positive reply to A^ tooug^ ^ of ^ the system s 

containing the answer. 4. If proxy server S * * 25 ™" to y lv locat6 files similar to F (by treating them 

for topic C, but it is unable to answer query Q using ^catty abUUy P y ^ m ^ d 

stored information, then it carnes out a paraUel depth-first f ^ objects >. above ) makes it possible to 

search by executing the following steps: (a) Set L ,bc the bear g ^ ^ ^ ^ if lhey are ^ored 
empty list, (b) Retrieve the locally stored subtree of MT(Q. all ^ ^ ^ i ed b 

For each server Si directly linked to S_ in «^ 30 *°^L, In a simple instantiation, all versions of the data 

other than S,„, (if specified), add the _ ordered pa* (S, S) to a^y method^ ^ ^ ^ 

the list L. (c) If L is empty, transmit a negative reply to n k Iq j^^on each ver- 

address Assaying that server S cannot loca u , an hkw«w date o^vera^ ^ referenc es or 

^JfotnrSset^ 35 P-t«s to me other versions. 

5m11 Wfai ulc iSt For each server pair (Ai, Bi) ^wS CLIPPING SERVICE 

on the list LI, form a locate message M(Ai, Bi), whict lis ^^a customized electronic identification of 

" P y of message M whose S to , ^ »« b™. ^ » d es^bl7S^ of the present invention can be used in the 
specify Bi, and transmit this message M(Ai B,) tc ^sen.er A, a ^ rf nG t (0 ^ , an auto- 

oVer a virtual point-to-point connection, (e) For each reply to eiec fa learas to xltc{ (flltet) 

Reived (by S) to a message sent in step (d), act as foUows: maUc ^ inlerests , based solely on 

m If a "positive" reply arrives to a locate message M(Ai, " e *"™ s me user chooses to read. The system for 
Bi)' then forward titis reply to A, and tennmate step 4 electronic identification of desirable objecjs 

tamediately. (ii) If a "negative" reply arnvesto ^locate c^t = ed ^ ^ ^ 

message M(Ai, Bi), then remove the pair (Ai, Bi) fomtoe gen & ^ ^ ^ rela fr cy of 

list LI (iii) , If the message M(Ai, Bi) could not be success- etecUom containcd m m6 ^cle. The system 

fully delivered to Ai, then remove the pair (M M .the ^ clcctronic id6ntifi c ati on of desirable object 

list U, and add the pair (Ci, A.) to the bst _IJ ^o ^r each Q f^ nerates a ^ profile set for each user, as a function 
other than Bi that is directly linked to Ai in the locally stored ^S e ° ffl rf the articles the user has accessed and 

sLee of NfT(C). (0 Once LI no longer cont-ns any pa, 50 of M*SW>^^ ^ ^ has provided 0 n these 
(Ai, Bi) for which a message M M ^0 has been «n^ or ^cJes as new articles are received for storage on the mass 
after a fixed period of time has elapsed, return to step (c). a of ^ infotmall0n servers 1,-1 , 

Retrieving Files from a Multicast Tree ^ * » f custonljzed electronic identification of desir- 

When a processor q in the network w*hes to retrieve to to ^ syswm ^ ^ ^ 

file associated with a given target object it execute ^the 55 gen ^ ^ ^ filesm me 

following steps. These steps are imUated by an enbty ^E m ^ and those new articles whose 

which may be either a user entering commands v a a users ^ sea p^ (o .cfosest search 

keyboard at a client q, as dlustrated in FIG. 3 oj an Urge P ^ ofile ^ ^ identlfied ^ >t 

automatic software process resident on a client or server P™ ^ ^ ^ter program providing he 

processor q. 1. Processor q forms a query Q teal xte a ^ c Cto to user monitors how much to user reads (the 
whether to recipient (a core server for cluster Q stdl st o^ amc^s ^ ^ mmt spe t 

a file F that was previously multicast to the m^ticas. t^e lhe ^ profit in the user's search 

MT(Q; if so, to recipient server should reply with Us own ^ what ^ fly 

server name. Note that processor q must 6J S^ rea d.Thedetaflsof the method used bythis system 

name of file F and the identity of cluster C; typicaU^ ^ this 65 p etoto r^ fonn ^ no _ g method 

information is provided ,c .entity E by a -"j^J^ a spedfic method of calculaUng user- 

news clipping service or browsing system described below, q 
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specific search profile sets, of measuring similarity between Compare Current Articles' Target Profiles to a User's Search 

two profiles, and of updating a user's search profile set (or Profiles 

more generally target profile interest summary) based on The process by which a user employs this apparatus to 

what the user read, and the examples disclosed herein are retrieve news articles of interest is illustrated in flow dia- 

examples of the many possible implementations that can be 5 gram form in FIG. 11. At step U01, the user logs into the 

used and should not be construed to limit the scope of the data communication network N via their client processor Q 

system. and activates the news reading program. This is accom- 

Inidalize Users' Search Profile Sets plished by the user establishing a pseudonymous data cora- 

The news clipping service instantiates target profile inter- muni cations connection as described above to a proxy server 

est summaries as search profile sets, so that a set of high- to S 2 , which provides front-end access to the data communi- 

interest search profiles is stored for each user. The search cation network N. The proxy server S 2 maintains a list of 

profiles associated with a given user change over time. As in authorized pseudonyms and their corresponding public keys 

any application involving search profiles, they can be ini- and provides access and billing control. The user has a 

tially determined for a new user (or explicitly altered by an search profile set stored in the local data storage medium on 

existing user) by any of a number of procedures, including 15 the proxy server S 2 . When the user requests access to "news" 

the following preferred methods: (1) asking the user to at step 1102, the profile matching module 203 resident on 

specify search profiles directly by giving keywords and/or proxy server S 2 sequentially considers each search profile p* 

numeric attributes, (2) using copies of the profiles of target from the user's search profile set to determine which news 

objects or target clusters that the user indicates are repre- articles are most likely of interest to the user. The news 

sentative of his or her interest, (3) using a standard set of 20 articles were automatically clustered into a hierarchical 

search profiles copied or otherwise determined from the cluster tree at an earlier step so that the determination can be 

search profile sets of people who are demographically made rapidly for each user. The hierarchical cluster tree 

similar to the user. serves as a decision tree for determining which articles* 

Retrieve New Articles from Article Source target profiles are most similar to search profile p^ the 

Articles are available on-line from a wide variety of 25 search for relevant articles begins at the top of the tree, and 

sources. In the preferred embodiment, one would use the at each level of the tree the branch or branches are selected 

current days news as supplied by a news source, such as the which have cluster profiles closest to p*. This process is 

AP or Reuters news wire. These news articles are input to the recursively executed until the leaves of the tree are reached, 

electronic media system by being loaded into the mass identifying individual articles of interest to the user, as 

storage system SS 4 of an information server S 4 . The article 30 described in the section "Searching for Target Objects" 

profile module 201 of the system for customized electronic above. 

identification of desirable objects can reside on the infor- A variation on this process exploits the fact that many 

mation server S4 and operates pursuant to the steps illus- users have similar interests. Rather than carry out steps 5-9 

trated in the flow diagram of FIG. 5, where, as each article of the above process separately for each search profile of 

is received at step 501 by the information server S 4 , the 35 each user, it is possible to achieve added efficiency by 

article profile module 201 at step 502 generates a target carrying out these steps only once for each group of similar 

profile for the article and stores the target profile in an article search profiles, thereby satisfying many users' needs at 

indexing memory (typically part of mass storage system SS 4 once. In this variation, the system begins by non- 

for later use in selectively delivering articles to users. This hierarchically clustering all the search profiles in the search 

method is equally useful for selecting which articles to read 40 profile sets of a large number of users. For each cluster k of 

from electronic news groups and electronic bulletin boards, search profiles, with cluster profile it uses the method 

and can be used as part of a system for screening and described in the section "Searching for Target Objects" to 

organizing electronic mail ("e-mail"). locate articles with target profiles similar to p*. Each located 

Calculate Article Profiles article is then identified as of interest to each user who has 

A target profile is computed for each new article, as 45 a search profile represented in cluster k of search profiles, 

described earlier. The most important attribute of the target Notice that the above variation attempts to match clusters 

profile is a textual attribute that stands for the entire text of of search profiles with similar clusters of articles. Since this 

the article. This textual attribute is represented as described is a symmetrical problem, it may instead be given a sym- 

earlier, as a vector of numbers, which numbers in the metrical solution, as the following more general variation 

preferred embodiment include the relative frequencies (TF/ 50 shows. At some point before the matching process 

IDF scores) of word occurrences in this article relative to commences, all the news articles to be considered are 

other comparable articles. The server must count the fre- clustered into a hierarchical tree, termed the "target profile 

quency of occurrence of each word in the article in order to cluster tree " and the search profiles of all users to be 

compute the TF/IDF scores. considered are clustered into a second hierarchical tree, 

These news articles are then hierarchically clustered in a 55 termed the "search profile cluster tree." The following steps 

hierarchical cluster tree at step 503, which serves as a serve to find all matches between individual target profiles 

decision tree for determining which news articles are closest from any target profile cluster tree and individual search 

to the user's interest. The resulting clusters can be viewed as profiles from any search profile cluster tree: 1. For each child 

a tree in which the top of the tree includes all target objects subtree S of the root of the search profile cluster tree (or, let 

and branches further down the tree represent divisions of the 60 S be the enure search profile cluster tree if it contains only 

set of target objects into successively smaller subclusters of one search profile): 2. Compute the cluster profile P s to be 

target objects. Each cluster has a cluster profile, so that at the average of all search profiles in subtree S 3. For each 

each oode of the tree, the average target profile (centroid) of subc luster (child subtree) T of the root of the target profile 

all target objects stored in the subtree rooted at that node is cluster tree (or, let T be the entire target profile cluster tree 

stored. This average of target profiles is computed over the 65 if it contains only one target profile): 4. Compute the cluster 

representation of target profiles as vectors of numeric profile P r to be the average of all target profiles in subtree 

attributes, as described above. T 5. Calculate d(? s , P r , the distance between ? s and P r 6. If 
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d(Pc, Pt)<i, a threshold, 7. If S contains only one search Update User Profiles 

profile and T contains only one target profile, declare a Updating of a user's generated search profile set can be 

match between that search profile and that target profile, 8. done at step 1108 mtl^d^tedm 

otherwise reoirse to step 1 to find all matches between U.S. patent appbcation Ser. No. 08/346,425. When an article 

search profiles in tree S and target profiles in tree T. 5 is read, the server S 2 shite each search profi e in the set 

The threshold used in step 6 is typically an affine function slightly in the direction of the target profiles of those nearby 

or other function of the greater of the cluster variances (or articles for which the computed measure of article attrac- 

cluster diameters) of S and T. Whenever a match is declared tiveness was high. Given a search profile with attributes u* 

between a search profile and a target profile, the target object from a user's search profile set, and a set of J articles 

that contributed the target profile is identified as being of available with attributes d Jk (assumed correct for now), 

interest to the user who contributed the search profile. Notice where I indexes users, j indexes articles, and k -indexes 

that the process can be applied even when the set of users to attributes, user I would be predicted to pick a set of P distinct 

be considered or the set of target objects to be considered is articles to minimize the sum of d(u„ b y ) over the chosen 

very small. In the case of a single user, the process reduces articles j. The user's desired attributes u^ and an article s 

to the method given for identifying articles of interest to a attributes d /t would be some form of word frequencies such 

single user. In the case of a single target object, the process 15 as TF/IDF and potentially other attributes such as the source, 

constitutes a method for identifying users to whom that reading level, and length of the article, while d(u, d y ) is the 

target object is of interest. distance between these two attribute vectors (profiles) using 

Present List of Articles to User the similarity measure described above. If the user picks a 

Once the profile correlation step is completed for a different set of P articles than was predicted the user search 

selected user or group of users, at step 1104 the profile 20 profile set generation module should try to ad] ust u and/or d 

processing module 203 stores a list of the identified articles to more accurately predict the articles the user selected In 

for presentation to each user. At a user's request, the profile particular, u y and/or d y should be shifted to increase their 

processing system 203 retrieves the generated list of relevant similarity if user I was predicted not to select article j out did 

articles and presents this list of titles of the selected articles select it, and perhaps also to decrease their similarity if user 

to the user, who can then select at step 1105 any article for 25 I was predicted to select article j but did not A preferred 

viewing. (If no tides are available, then the first scntence(s) method is to shift u for each wrong prediction that user I will 

of each article can be used.) The list of article titles is sorted not select article j, using the formula: UaW*-e(u a d /t ) 

according to the degree of similarity of the article's target Here u, is chosen to be the search profile from user I s 

profile to the most similar search profile in the user's search search profile set that is closest to target pronle. It e is 

profile set The resulting sorted list is either transmitted in 30 positive, this adjustment increases the match between user 

realtime to the usercUentprocessor C,,if theuseris present I's search profile set and the target profiles of the articles 

at their client processor C„ or can be transmitted to a user's user I actually selects, by making u 7 closer to d, for the case 

mailbox, resident on the user's client processor C, or stored where the algorithm failed to predict an article that the 

within the server S 2 for later retrieval by the user; other viewer selected. The size of e determines how many 

methods of transmission include facsimile transmission of 35 example articles one must see to change the search pronle 

the printed list or telephone transmission by means of a substantially. If e is too large, the algorithm becomes 

text-to^peech system. The user can then transmit a request unstable, but for sufficiently small e, it drives u to its correct 

by computer, facsimile, or telephone to indicate which of the value. In general, e should be proportional to the measure ot 

identified articles the user wishes to review, if any. The user article attractiveness; for example, it should be relative y 

can still access all articles in any information server S 4 to 40 high if user I spends a long time reading article j. One could 

which the user has authorized access, however, those lower in theory also use the above formula to decrease the match 

on the generated list are simply further from the user's in the case where the algorithm predicted an article that the 

interests, as determined by the user's search profile set. The user did not read, by making e negative in that case, 

server S 2 retrieves the article from the local data storage However, there is no guarantee that u will move in the 

medium or from an information server S 4 and presents the 45 correct direction in that case. One can also shift the attribute 

article one screen at a time to the user's client processor C,. weights w, of user 1 by using a similar algorithm: w^w*- 

The user can at any time select another article for reading or e^-J)^ (w^-e|u^ /Jt D This is particularly important if 

exit the process one ^ combining word frequencies with other attributes. As 

Monitor Which Articles Are Read before, this increases the match if e is positive-for the case 

Tbe user's search profile set generator 202 at step 1107 50 where the algorithm failed to predict an article that the user 

monitors which articles the user reads, keeping track of how read, this time by decreasing the weights on those charac- 

many pages of text are viewed by the user, how much time teristics for which the user's target profile u, differs from the 

is spent viewing the article, and whether all pages of the article's profile d,. Again, the size of e determines how many 

article were viewed. This information can be combined to example articles one must see to replace what was originally 

measure the depth of the user's interest in the article, 55 believed. Unlike the procedure for adjusting u, one also 

yielding a passive relevance feedback score, as described make use of the fact that the above algorithm decreases the 

earlier. Although the exact details depend on the length and match if e is negative-for the case where the algorithm 

nature of the articles being searched, a typical formula might predicted an article that the user did not read. The denomi- 

be: measure of article attractiveness-0.2 if the second page nator of the expression prevents weights from shrinking to 

is accessed +0 2 if all pages are accessed +0.2 if more thao 60 zero over time by renormalizing the modified weights w, so 

30 seconds was spent on the article +0.2 if more than one that they sum to one. Both u and w can be adjusted for each 

minute was spent on the article +0.2 if the minutes spent in article accessed. When e is small, as it should be, there is no 

the article are greater than half the number of pages. conflict between the two parts of the algorithm. The selected 

The computed measure of article attractiveness can then user's search profile set is updated at step 1108. 

be used as a weighting function to adjust the user's search 65 Further Applications of the Filtering Technology 

profile set to thereby more accurately reflect the user's The news clipping service may deliver news articles (or 

dynamically changing interests. advertisements and coupons for purchasables) to off-line 
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«dkr. to »" •»»<•««»• "".P™" SXTmS' Odor's «nDU>ta«») *• 

ShrtSat may be printed and physically sent to user U es.ec in item, ^ rf means mcluding 

where group G consists of either the ^ .^.^ £ to identify interest may be unused for a p^ticula 

Tset of off-line users who are demographically sunilarto ^ ^ te syste m to present that advert*e 

P fi ! P redicted and the interest of article T to group 25 objecte ) that are ^t^, man y different 

Tis in^bftte maximum interest in articlcTby any of cxampfc ^ t0 profile 6ach investment 

Sesc f hypottetical users; finally, the customized newspa- ^^^J^^ behavior is characterize 1 in the 

STto Sr group G is constructed from those arucles of Tta P „ fi, «™S 

J£»K articks provided by a ^e source^ut °P^»^ g pf ofil ] ng me thod described above ^ 

Second application, termed "broadcast clipping, where dm ^ a stoongiy Q6 t weight for a 

S3 'S tu»4S»««*- ■»*» si <»* o » "* «rl 2£ESX .« ~. 

SSSSS25#= 2£~SSS#4 

or ?£L t«HD or video streaming data) on the ^%^°" 0 g mereby segment me user's portfoho a^ord- 

"etwSL are currently in progress, and employs new. 50 ^ ^ investment type can then be 
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similar users which have a similar stock portfolio to that of 1. Automatically create a "customized newspaper", 

the user are instead considered similar. Accordingly, owners User profiling enabling custom recommendations may be 

of stocks which are metrically similar to certain articles are achieved by purely passive means of user activity data or if 

targeted with those articles. By applying similar techniques desired, it can refine and automate the selection process of 

in this application to those herein described, relevance 5 articles within user selected categories of interest as well as 

feedback determines the metric similarity of the associative recommend articles within different categories which the 

attributes which is each stock, with the relevant associative 15 ^ t0 P rcfcr 35 evidenced through past behaviors, 

attributes which are each article (or their associated textual, Applications include: 

descriptive or numeric attributes contained therein). Addi- ( a ) Presentation of new articles and corresponding adyer- 

tionally in this regard, it is also possible to bias the weighting 10 / ^™ nls M « ^ * terest 10 i th % user u 

values of users providing relevance feedback to favor those (b) Recommending (highlighting) these articles from the 

who have invested in similar types of stocks and who have . directory 

, _, c t . . ... . 2. A customized search engine which offers search results 

a proven track record of success through their trading ... ♦ -i j j i i j * r 

, r . . . .. 4 . c . • . tT- t c which are tailored and relevancy ranked to user preferences, 

decisions. Another application for which this type of pre- - IT - r<ri- c u ♦ * 

.... r jl i • r , • j. „ 3. Using a survey for off-line users for subsequent issues, an 

adjusted relevance feedback is useful in recommending is. _ f • . • 4 , • - . . r- 

■* . • ii j. t , ^. . * i » inserted card inserted mto each issue identifies or prioritizes 

and/or automatically trading the most interesting stocks to nrf - , op/ 

. ,fji_ j f , the most interesting articles/ads. 

users using the present methods above described, however, ^ Notification 

again biasing the relevance feedback to the system by those \ * * ^ ™ and novel characteristic of the architec- 

users who had been most successful u their past tradmg mre ^ £ new oru ^ ated objec(s 

decisions with regards to those particular types of stocks. 20 . 4 . f t . J , , r . , , * J , 

_ „ • i . . • j c i -ii that are relevant to the user, as determined by the user s 

Because financial advisors possess varying degrees f skill , a . t t ~. . t ; 

... ... r . * ■ . . u search profile set or target profile interest summary. 

which vanes within different types of investments, such a , 4 r , , t , . t „ . & . / . , r j 

... . , . i . «• • . . j . (Updated target objects include revised versions of docu- 

collaborative filtering based market for investment need not y K . & jir . L1 j\t-u 

. t . . , . . & . . ments and new models of purchasable goods.) The system 

be limited to stocks but to other types of investments as well. . c . r . . * * V * L 

„ . * i_- « ^ _■_ • » ■ u , may notify the user of these relevant target objects by an 

The market pnee for which this expert advice" is purchased 25 .. . _ , _ J v. 

. v ll * electronic notification such as an e-mail message or fac- 

by would be investors, which have an infinity to investments .. . . ¥ . . , . 

I. . t A « • . . J 4 . simile transmission. In the variation where the system sends 

of the particular types that those advisor are experts in may 4 . , .. C1 . 4 r , 

v . ' r 4t , ••_ j * i_ • r j , an e-mail message, the user s e-mail filter can then respond 

measured using the presently described techniques for deter- • . i . .J *-c .* r • . u u • • 

- & . * . . . ■ . • appropnately to the notification, for instance, by bringing 

mmation of pnee point thus advice by a given expert for i( f r ...... j* . i . *u » \ Z 

4 F , . .\_ , . * * j • i t * the notification immediately to the user s personal attention, 

investments which had demonstrated a given level or sue- 30 . . . . . \ # ' 

A .,.. n B . , or by automatically submitting an electronic request to 

cess may be pneed similarly. Additionally, some gross level * . A * . . 4 ° , . t , a 

. J( _ / F . . / . v & . , purchase the target object named m the notification. A 

feedback suggesting the advisors current awareness about r . . i r *l i •* * c u. -i cn 

* . u u * *- 11 j i sunple example of the latter response is for the e-mad filter 

investment types could be automatically assessed by pas- v . r j < . ■ i u 

. . . ; j ... . ... ' , . , . \ t to retrieve an on-line document at a nominal or zero charge, 

sively observed which articles within which investment , , . - . 4 , . 4 , & 

, J . , , , | j . vi or request to buy a purchasable of limited quantity such as 

domains the user had been recently reading on-line. In 35 ^ 

, . . . • i a used product or an auctionable. 

accordance with the similarity techniques previously r 

described, the user may browse between tbe genres of ACTIVE NAVIGATION (BROWSING) 

articles and stocks which are most relevant to one another. Browsing by Navizating Through a Cluster Tree 

Because there are numerous systems and software tools A hierarchical cluster tree imposes a useful organization 

which are used in attempting to predict both selected stocks 40 on a collection of target objects. The tree is of direct use to 

and optimal times to buy or trade them, the current user a user who wishes to browse through all the target objects in 

customization techniques are best implemented as an the tree. Such a user may be exploring the collection with or 

enhancement feature to provide the user with not only without a well-specified goal. The tree's division of target 

quality but also personalization. objects into coherent clusters provides an efficient method 

In the preferred implementation for an on-line newspaper 45 whereby the user can locate a target object of interest. The 

or news filter, each of the above capabilities for customized user first chooses one of the highest level (largest) clusters 

recommendation and notification of investment related from a menu, and is presented with a menu listing the 

articles, stock recommendations and automated stock moni- subclusters of said cluster, whereupon the user may select 

toring and trading features are provided to the user as an one of these subclusters. The system locates the subclusters, 

integrated financial news and investment service. 50 via the appropriate pointer that was stored with the larger 

Additionally, in accordance with the virtual communities cluster, and allows the user to select one of its subclusters 

section below described, users sharing common portfolios from another menu. This process is repeated until the user 

may wish to correspond on-line to advice or experiences comes to a leaf of the tree, which yields the details of an 

with other similar users. Additionally, users who have a past actual target object. Hierarchical trees allow rapid selection 

track record of success may also be particularly identifiable 55 of one target object from a large set. In ten menu selections 

through these virtual communities in conjunction with their from menus of ten items (subclusters) each, one can reach 

participation or their comments and advice relating to spe- 10,000,000,000 (ten billion) items. In the preferred 

cific stocks may be ascribed to those stocks, credentialed as embodiment, the user views the menus on a computer screen 

originating from an expert with a proven track record (and or terminal screen and selects from them with a keyboard or 

made publicly available). 60 mouse. However, tbe user may also make selections over the 

telephone, with a voice synthesizer reading the menus and 

OTHER ON-LINE NEWSPAPER INTERFACE me user selecting subclusters via the telephone's touch-tone 

FEATURES keypad. In another variation, the user simultaneously main- 
tains two connections to the server, a telephone voice 

In accordance with current on-line news interface 65 connection and a fax connection; the server sends successive 

features, several implementation features of the present menus to the user by fax, while the user selects choices via 

system include the following: the telephone's touch-tone keypad. 
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Just as user profiles commonly include an associative 
attribute indicating the user's degree of interest in each 
target object, it is useful to augment user profiles with an 
additional associative attribute indicating the user's degree 
of interest in each cluster in the hierarchical cluster tree . This 5 
degree of interest may be estimated numerically as the 
number of subclusters or target objects the user has selected 
from menus associated with the given cluster or its 
subclusters, expressed as a proportion of the total number of 
subclusters or target objects the user has selected. This 1Q 
associative attribute is particularly valuable if the hierarchi- 
cal tree was built using "soft" or "fuzzy" clustering, which 
allows a subclusters or target object to appear in multiple 
clusters: if a target document appears in both the "sports" 
and the "humor" clusters, and the user selects it from a menu 
associated with the "humor" cluster, then the system 15 
increases its association between the user and the "humor" 
cluster but not its association between the user and the 
"sports" cluster. 
Labeling Clusters 

Since a user who is navigating the cluster tree is repeat- 20 
edly expected to select one of several subclusters from a 
menu, these subclusters must be usefully labeled (at step 
503), in such a way as to suggest their content to the human 
user. It is straightforward to include some basic information 
about each subchister in its label, such as the number of 25 
target objects the subchister contains (possibly just 1) and 
the number of these that have been added or updated 
recently. However, it is also necessary to display additional 
information that indicates the cluster's content. This 
content-descriptive information may be provided by a 30 
human, particularly for large or frequently accessed clusters, 
but it may also be generated automatically. The basic 
automatic technique is simply to display the cluster's "char- 
acteristic value" for each of a few highly weighted attributes. 
With numeric attributes, this may be taken to mean the 35 
cluster's average value for that attribute: thus, if the "year of 
release" attribute is highly weighted in predicting which 
movies a user will like, then it is useful to display average 
year of release as part of each cluster's label. Thus the user 
sees that one cluster consists of movies that were released 40 
around 1962, while another consists of movies from around 
1982. For short textual attributes, such as "title of movie" or 
"title of document," the system can display the attribute's 
value for the cluster member (target object) whose profile is 
most similar to the cluster's profile (the mean profile for all 45 
members of the cluster), for example, the title of the most 
typical movie in the cluster. For longer textual attributes, a 
useful technique is to select those terms for which the 
amount by which the term's average TF/IDF score across 
members of the cluster exceeds the term's average TF/IDF 50 
score across all tar get objects is greatest, either in absolute 
terms or else as a fraction of the standard deviation of the 
term's TF/IDF score across all target objects. The selected 
terms are replaced with their morphological stems, elimi- 
nating duplicates (so that if both "slept" and "sleeping" were 55 
selected, they would be replaced by the single term "sleep") 
and optionally eliminating close synonyms or collocates (so 
that if both "nurse" and "medical" were selected, they might 
both be replaced by a single term such as "nurse," 
"medical," "medicine," or "hospital"). The resulting set of 60 
terms is displayed as part of the label. Finally, if freely 
redistributable thumbnail photographs or other graphical 
images are associated with some of the target objects in the 
cluster for labeling purposes, then the system can display as 
part of the label the image or images whose associated target 65 
objects have target profiles most similar to the cluster 
profile. 
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Users' navigational patterns may provide some useful 
feedback as to the quality of the labels. In particular, if users 
often select a particular cluster to explore, but then quickly 
backtrack and try a different cluster, this may signal that the 
first cluster's label is misleading. Insofar as other terms and 
attributes can pro video "next-best" alternative labels for the 
first cluster, such "next-best" labels can be automatically 
substituted for the misleading label. In addition, any user can 
locally relabel a cluster for his or her own convenience. 
Although a cluster label provided by a user is in general 
visible only to that user, it is possible to make global use of 
these labels via a "user labels" textual attribute for target 
objects, which attribute is defined for a given target object to 
be the concatenation of all labels provided by any user fbr 
any cluster containing that target object This attribute 
influences similarity judgments: for example, it may induce 
the system to regard target articles in a cluster often labeled 
"Sports News" by users as being mildly similar to articles in 
an otherwise dissimilar cluster often labeled "International 
News" by users, precisely because the "user labels" attribute 
in each cluster profile is strongly associated with the term 
"News." The "user label" attribute is also used in the 
automatic generation of labels, just as other textual attributes 
are, so that if the user-generated labels for a cluster often 
include "Sports," the term "Sports" may be included in the 
automatically generated label as well. 

It is not necessary for menus to be displayed as simple 
lists of labeled options; it is possible to display or print a 
menu in a form that shows in more detail the relation of the 
different menu options to each other. Thus, in a variation, the 
menu options are visually laid out in two dimensions or in 
a perspective drawing of three dimensions. Each option is 
displayed or printed as a textual or graphical label. The 
physical coordinates at which the options are displayed or 
printed are generated by the following sequence of steps: (1) 
construct for each option the cluster profile of the cluster it 
represents, (2) construct from each cluster profile its decom- 
position into a numeric vector, as described above, (3) apply 
singular value decomposition (SVD) to determine the set of 
two or three orthogonal linear axes along which these 
numeric vectors are most greatly differentiated, and (4) take 
the coordinates of each option to be the projected coordi- 
nates of that option's numeric vector along said axes. Step 
(3) may be varied to determine a set of, say, 6 axes, so that 
step (4) lays out the options in a 6-dimensional space; in this 
case the user may view the geometric projection of the 
6-dimensional layout onto any plane passing through the 
origin, and may rotate this viewing plane in order to see 
differing configurations of the options, which emphasize 
similarity with respect to differing attributes in the profiles 
of the associated clusters. In the visual representation, the 
sizes of the cluster labels can be varied according to the 
number of objects contained in the corresponding clusters. 
In a further variation, all options from the parent menu are 
displayed in some number of dimensions, as just described, 
but with the option corresponding to the current menu 
replaced by a more prominent subdisplay of the options on 
the current menu; optionally, the scale of this composite 
display may be gradually increased over time, thereby 
increasing the area of the screen devoted to showing the 
options on the current menu, and giving the visual impres- 
sion that the user is regarding the parent cluster and "zoom- 
ing in" on the current cluster and its subclusters. 
Further Navigational 

It should be appreciated that a hierarchical cluster-tree 
may be configured with multiple cluster selections branch- 
ing from each node or the same labeled clusters presented in 
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the form of single branches for multiple nodes ordered in a 
hierarchy. In one variation, the user is able to perform lateral 
navigation between neighboring clusters as well, by request- 
ing that the system search for a cluster whose cluster profile 
resembles the cluster profile of the currently selected cluster. 5 
If this type of navigation is performed at the level of 
individual objects (leaf ends), then automatic hyperlinks 
may be then created as navigation occurs. This is one way 
that nearest neighbor clustering navigation may be per- 
formed. For example, in a domain where target objects are to 
home pages on the World Wide Web, a collection of such 
pages could be laterally linked to create a "virtual mall". 
Most importantly, links to sites in the form of targeted 
advertisements may be temporarily generated (as a result of 
the user profile and the target object profile of the page being 15 
visited, the dialogue being conducted or the content being 
viewed, listened to or read at that moment). This is one way 
in which "on the fly" automatic creation of customized links 
may occur (user specific linking of advertisers with sites or 
other content including programming or joint ads or pro- 20 
motions between advertisers may occur in real time). Or in 
another period this technique may be used to recommend the 
most befitting sites and/or ads which should be linked 
together (based upon their similarity). Of course, certain 
promotions for example may be directly competitive such as 25 
a product for two brands of toothpaste. Such direct com- 
petitive overlap must thus be accounted for. This technique 
may also account for one way or two way (exchanged) links 
between vendors. Advertisers which exchange links or wish 
to link to a "prime location" should pay a price which is 30 
directly in accordance with the market demand for that 
advertisement though not exceeding the price value neces- 
sary to fill the available ad space. The techniques described 
in co-pending patent application entitled "PPS" suggests a 
method of automatically generating a customized motion (or 35 
joint promotion) for individual users. A similar technique 
may be used to automatically establish a price for the ad 
space (based on a combined predicted price per impression 
and predicted value for the average customer expected to 
access that advertisement. As feedback occurs, this pricing 40 
model is adjusted according to actual response feedback, 
links may be broken, reformed in a one way or two way 
context in automatic fashion as such. 

The simplest way to use the automatic menuing system 
described above is for the user to begin browsing at the top 45 
of the tree and moving to more specific subclusters. 
However, in a variation, the user optionally provides a query 
consisting of textual and/or other attributes, from which 
query the system constructs a profile in the manner 
described herein, optionally altering textual attributes as so 
described herein before decomposing them into numeric 
attributes. Query profiles are similar to the search profiles in 
a user's search profile set, except that their attributes are 
explicitly specified by a user, most often for one-time usage, 
and unlike search profiles, they are not automatically 55 
updated to reflect changing interests. A typical query in the 
domain of text articles might have "Tell me about the 
relation between Galileo and the Medici family" as the value 
of its "text of article" attribute, and 8 as the value of its 
"reading difficulty" attribute (that is, 8th-grade level). The 60 
system uses the method of section "Searching for Target 
Objects" above to automatically locate a small set of one or 
more clusters with profiles similar to the query profile, for 
example, the articles they contain are written at roughly an 
8th-grade level and tend to mention Galileo and the Medicis. 65 
The user may start browsing at any of these clusters, and can 
move from it to subclusters, supe re lusters, and other nearby 
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clusters. For a user who is looking for something in 
particular, it is generally less efficient to start at the largest 
cluster and repeatedly select smaller subclusters than it is to 
write a brief description of what one is looking for and then 
to move to nearby clusters if the objects initially recom- 
mended are not precisely those desired. 

Although it is customary in information retrieval systems 
to match a query to a document, an interesting variation is 
possible where a query is matched to an already answered 
question. The relevant domain is a customer service center, 
electronic newsgroup, or Better Business Bureau where 
questions are frequently answered. Each new question- 
answer pair is recorded for future reference as a target 
object, with a textual attribute that specifies the question 
together with the answer provided. As explained earlier with 
reference to document titles, the question should be 
weighted more heavily than the answer when this textual 
attribute is decomposed into TF/IDF scores. A query speci- 
fying "Tell me about the relation between Galileo and the 
Medici family" as t he value of this attribute therefore 
locates a cluster of similar questions together with their 
answers. In a variation, each question-answer pair may be 
profiled with two separate textual attributes, one for the 
question and one for the answer. A query might then locate 
a cluster by specifying only the question attribute, or for 
completeness, both the question attribute and the (lower- 
weighted) answer attribute, to be the text "Tell me about the 
relation between Galileo and the Medici family." 

The filtering technology described earlier can also aid the 
user in navigating among the target objects. When the 
system presents the user with a menu of subclusters of a 
cluster C of target objects, it can simultaneously present an 
additional menu of the most interesting target objects in 
cluster C, so that the user has the choice of accessing a 
subcluster or directly accessing one of the target objects. If 
this additional menu lists n target objects, then for each I 
between 1 and n inclusive, in increasing order, the I th most 
prominent choice on this additional menu, which choice is 
denoted Top(C,i), is found by considering all target objects 
in cluster C that are further than a threshold distance t from 
all of Top(C,l),Top(C,2), . . . Top(C, 1-1), and selecting the 
one in which the user's interest is estimated to be highest. If 
the threshold distance t is 0, then the menu resulting from 
this procedure simply displays the n most interesting objects 
in cluster C, but the threshold distance may be increased to 
achieve more variety in the target objects displayed. Gen- 
erally the threshold distance t is chosen to be an aifine 
function or other function of the cluster variance or cluster 
diameter of the cluster C. 

As a novelty feature, the user U can "masquerade" as 
another user V, such as a prominent intellectual or a celebrity 
supemodel; as long as user U is masquerading as user V, the 
filtering technology will recommend articles not according 
to user U*s preferences, but rather according to user V*s 
preferences. Provided that user U has access to the user- 
specific data of user V, for example because user V has 
leased these data to user U for a financial consideration, then 
user U can masquerade as user V by instructing user IPs 
proxy server S to temporarily substitute user V*s user profile 
and target profile interest summary for user U's. In a 
variation, user U has access to an average user profile and an 
composite target profile interest summary for a group G of 
users; by instructing proxy server S to substitute these for 
user U's user-specific data, user U can masquerade as a 
typical member of group G, as is useful in exploring group 
preferences for sociological, political, or market research. 
More generally, user U may "partially masquerade" as 
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another user V or group G, by instructing proxy server S to 
temporarily replace user U's user-specific data with a 
weighted average of user U's user-specific data and the 
user-specific data for user V and group G. 
Menu Organization 5 

Although the topology of a hierarchical cluster tree is 
fixed by the techniques that build the tree, the hierarchical 
menu presented to the user for the user's navigation need not 
be exactly isomorphic to the cluster tree. The menu is 
typically a somewhat modified version of the cluster tree, 10 
reorganized manually or automatically so that the clusters 
most interesting to a user are easily accessible by the user. 
In order to automatically reorganize the menu in a user- 
specific way, the system first attempts automatically to 
identify existing clusters that are of interest to the user. The 15 
system may identify a cluster as interesting because the user 
often accesses target objects in that cluster — or, in a more 
sophisticated variation, because the user is predicted to have 
high interest in the cluster's profile, using the methods 
disclosed herein for estimating interest from relevance feed- 20 
back. 

Several techniques can then be used to make interesting 
clusters more easily accessible. The system can at the user's 
request or at all times display a special list of the most 
interesting clusters, or the most interesting subclusters of the 25 
current cluster, so that the user can select one of these 
clusters based on its label and jump directly to it. In general, 
when the system constructs a list of interesting clusters in 
this way, the I th most prominent choice on the list, which 
choice is denoted Top(I), is found by considering all appro- 30 
priate clusters C that are fairther than a threshold distance t 
from all of Top(l), Top(2), . . . Top(I-l), and selecting the 
one in which the user's interest is estimated to be highest. 
Here the threshold distance t is optionally dependent on the 
computed cluster variance or cluster diameter of the profiles 35 
in the latter cluster. Several techniques that reorganize the 
hierarchical menu tree are also usefil. First, menus can be 
reorganized so that the most interesting subcluster choices 
appear earliest on the menu, or are visually marked as 
interesting; for example, their labels are displayed in a 40 
special color or type face, or are displayed together with a 
number or graphical image indicating the likely level of 
interest. Second, interesting clusters can be moved to menus 
higher in the tree, i.e., closer to the root of the tree, so that 
they are easier to access if the user starts browsing at the root 45 
of the tree. Third, uninteresting clusters can be moved to 
menus lower in the tree, to make room for interesting 
clusters that are being moved higher. Fourth, clusters with an 
especially low interest score (representing active dislike) can 
simply be suppressed from the menus; thus, a user with 50 
children may assign an extremely negative weight to the 
"vulgarity" attribute in the determination of q, so that vulgar 
clusters and documents will not be available at all. As the 
interesting clusters and the documents in them migrate 
toward the top of the tree, a customized tree develops that 55 
can be more efficiently navigated by the particular user. If 
menus are chosen so that each menu item is chosen with 
approximately equal probability, then the expected number 
of choices the user has to make is minimized. If, for 
example, a user frequently accessed target objects whose 60 
profiles resembled the cluster profile of cluster (a, b, d) in 
FIG. 8 then the menu in FIG. 9 could be modified to show 
the structure illustrated in FIG. 10. 

In the variation where the general techniques disclosed 
herein for estimating a user's interest from relevance feed- 65 
back are used to identify interesting clusters, it is possible 
for a user U to supply "temporary relevance feedback" to 
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indicate a temporary interest that is added to his or her usual 
interests. This is done by entering a query as described 
above, i.e., a set of textual and other attributes that closely 
match the user's interests of the moment . This query 
becomes "active," and affects the system's determination of 
interest in either of two ways. In one approach, an active 
query is treated as if it were any other target object, and by 
virtue of being a query, it is taken to have received relevance 
feedback that indicates especially high interest. In an alter- 
native approach, target objects X whose target profiles are 
similar to an active query's profile are simply considered to 
have higher quality q(U, X), in that q(U, X) is incremented 
by a term that increases with target object X's similarity to 
the query profile. Either strategy affects the usual interest 
estimates: clusters that match user U's usual interests (and 
have high quality q(*)) are still considered to be of interest, 
and clusters w hose profiles are similar to an active query are 
adjudged to have especially high interest. Clusters that are 
similar to both the query and the user's usual interests are 
most interesting of all. The user may modify or deactivate an 
active query at any time while browsing. In addition, if the 
user discovers a target object or cluster X of particular 
interest while browsing, he or she may replace or augment 
the original (perhaps vague) query profile with the target 
profile of target object or cluster X, t hereby amplifying or 
refining the original query to indicate an particular interest 
in objects similar to X. For example, suppose the user is 
browsing through documents, and specifies an initial query 
containing the word "Lloyd's," so that the system predicts 
documents containing the word "Lloyd's" to be more inter- 
esting and makes them more easily accessible, even to the 
point of listing such documents or clusters of such 
documents, as described above. In particular, certain articles 
about insurance containing the phrase "Lloyd's of London" 
are made more easily accessible, as are certain pieces of 
Welsh fiction containing phrases like "Lloyd's father." The 
user browses while this query is active, and hits upon a 
useful article describing the relation of Lloyd's of London to 
other British insurance houses; by replacing or augmenting 
the query with the full text of this article, the user can turn 
the attention of the system to other documents that resemble 
this article, such as documents about British insurance 
houses, rather than Welsh folk tales. 

In a system where queries are used, it is useful to include 
in the target profiles an associative attribute that records the 
associations between a target object and whatever terms are 
employed in queries used to find that target object. The 
association score of target object X with a particular query 
term T is defined to be the mean relevance feedback on 
target object X, averaged over just those accesses of target 
object X that were made while a query containing term T 
was active, multiplied by the negated logarithm of term T's 
global frequency in all queries. The effect of this associative 
attribute is to increase the measured similarity of two 
documents if they are good responses to queries that contain 
the same terms. A further maneuver can be used to improve 
the accuracy of responses to a query: in the summation used 
to determine the quality q(U, X) of a target object X, a term 
is included that is proportional to the sum of association 
scores between target object X and each term in the active 
query, if any, so that target objects that are closely associated 
with terms in an active query are determined to have higher 
quality and therefore higher interest for the user. To comple- 
ment the system's automatic reorganization of the hierar- 
chical cluster tree, the user can be given the ability to 
reorganize the tree manually, as he or she sees fit. Any 
changes are optionally saved on the user's local storage 
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device so that they will affect the presentation of the tree in Network Context of the Browsing System 
future sessions. For example, the user can choose to move or The files associated with target objects are typically 
copy menu options to other menus, so that useful clusters distributed across a large number of different servers Sl-So 
can thereafter be chosen directly from the root menu of the anc j clients Cl-Cn. Each file has been entered into the data 
tree or from other easily accessed or topically appropriate 5 storage medium at some server or client in any one of a 
menus. In an other example, the user can select clusters C a , number of ways, including, but not limited to: scanning, 
Cj, . . . Q listed on a particular menu M and choose to keyboard input, e-mail, FTP transmission, automatic syn- 
remove these clusters from the menu, replacing them on the mesis from anotner fi ] e tDe control of another corn- 
menu with a single aggregate cluster NT containing all the piUer program y^fo a system t0 enable ^5 to efficiently 
target objects from clusters C„ C>, . C r In this case, the JQ bcate ^ s[Qrc itg hierarchical cluster tree on 
immediate subclusters of new cluster M are either taken to ^ cenlrali2ed machin ler efficiency can be 
be clusters C lt . . - C, henaselves, or else, m a variation ^ hierarchical cluster tree is 
similar to the "scatter-gatber" method, are automatically & , n . 
computed by clustering the set of all the subclusters of Routed across many machines in the network Each 
clusters C lt C* - - . Q according to the similarity of the cluste f C > eluding single-member clusters (target objects), 
cluster profiles of these subclusters. 15 * d**^ reprinted by a file F, which is multicast to a 
Electronic Mall topical multicast tree MT(C1); here cluster CI is either 
In one application, toe browsing techniques described cluster C itself or some supercluster of cluster C. In this way, 
above may be applied to a domain where the target objects file F is stored at multiple servers, for redundancy. The file 
are purchasable goods. When shoppers look for goods to F that represents cluster C contains at least the following 
purchase over the Internet or other electronic media, it is 20 data: 

typically necessary to display thousands or tens of thousands 1. The cluster profile for cluster C, or data sufficient to 

of products in a fashion that helps consumers find the items reconstruct this cluster profile. 2. The number of target 

they are looking for. The current practice is to use hand- objects contained in cluster C. 3. Ahuman-readable label for 

crafted menus and sub-menus in which similar items are cluster C, as described in section "Labeling Clusters" above, 

grouped together. It is possible to use the automated clus- 25 4. If the cluster is divided into subclusters, a list of pointers 

tering and browsing methods described above to more to files representing the subclusters. Each pointer is an 

effectively group and present the items. Purchasable items ordered pair containing naming, first, a file, and second, a 

can be hierarchically clustered using a plurality of different multicast tree or a specific server where that file is stored. 5. 

criteria. Useful attributes for a purchasable item include but If the cluster consists of a single target object, a pointer to 

are not limited to a textual description and predefined 30 the file corresponding to that target object, 

category labels (if available), the unit price of the item, and The process by which a client machine can retrieve the file 

an associative attribute listing the users who have bought F from the multicast tree MT(C 1) is described above in 

this item in the past. Also useful is an associative attribute section "Retrieving Files from a Multicast Tree." Once it has 

indicating which other items are often bought on the same retrieved file F, the client can perform further tasks pertain- 

shopping "trip" as this item; items that are often bought on 35 ing to this cluster, such as displaying a labeled menu of 

the same trip will be judged similar with respect to this subclusters, from which the user may select subclusters for 

attribute, so tend to be grouped together. Retailers may be the client to retrieve next. 

interested in utilizing a similar technique for purposes of The advantage of this distributed implementation is three- 
predicting both the nature and relative quantity of items fold. First, the system can be scaled to larger cluster sizes 
which are likely to be popular to their particular clientele. 40 and numbers of target objects, since much more searching 
This prediction may be made by using aggregate purchasing and data retrieval can be carried out concurrently. Second, 
records as the search profile set from which a collection of the system is fault-tolerant in that partial matching can be 
target objects is recommended. Estimated customer demand achieved even if portions of the system are temporarily 
which is indicative of (relative) inventory quantity for each unavailable. It is important to note here the robustness due 
target object item is determined by measuring the cluster 45 to redundancy inherent in our design — data is replicated at 
variance of that item compared to another target object item tree sites so that even if a server is down, the data can be 
(which is in stock). located elsewhere. 

As described above, hierarchically clustering the purchas- The distributed hierarchical cluster tree can be created in 

able target objects results in a hierarchical menu system, in a distributed fashion, that is, with the participation of many 

which the target objects or clusters of target objects that 50 processors. Indeed, in most applications it should be recre- 

appear on each menu can be labeled by names or icons and ated from time to time, because as users interact with target 

displayed in a two-dimensional or mree-dimensional menu objects, the associative attributes in the target profiles of the 

in which similar items are displayed physically near each target objects change to reflect these interactions; the sys- 

other or on the same graphically represented "shelf." As tern's similarity measurements can therefore take these 

described above, this grouping occurs both at the level of 55 interactions into account when judging similarity, which 

specific items (such as standard size Ivory soap or large allows a more perspicuous cluster tree to be built The key 

Breck shampoo) and at the level of classes of items (such as technique is the following procedure for merging n disjoint 

soaps and shampoos). When the user selects a class of items cluster trees, represented respectively by files H ... Fa in 

(for instance, by clicking on it), then the more specific level distributed fashion as described above, into a combined 

of detail is displayed. It is neither necessary nor desirable to 60 cluster tree that contains all the target objects from all these 

limit each item to appearing in one group; customers are trees. The files Fl . . . Fn are described above, except that the 

more likely to find an object if it is in multiple categories. cluster labels are not included in the representation. The 

Non-purchasable objects such as artwork, advertisements, following steps are executed by a server SI, in response to 

and free samples may also be added to a display of pur- a request message from another server SO, which request 

chasable objects, if they are associated with (liked by) 65 message includes pointers to the files Fl ... Fn. 1. Retrieve 

substantially the same users as are the purchasable objects in files Fl . . . Fn. 2. Let L and M be empty lists. 3. For each 

the display. file Fi from among Fl . . . Fn: 4. If file Fi contains pointers 
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to subcluster files, add these pointers to list L. 5. If file Fi 
represents a single target object, add a pointer to file Fi to list 
L. 6. For each pointer X on list L, retrieve the file that pointer 
P points to and extract the cluster profile fl(X) that this file 
stores. 7. Apply a clustering algorithm to group the pointers 
X on list L according to the distances between their respec- 
tive cluster profiles P(X). 8. For each (nonempty) resulting 
group C of pointers: 9. If C contains only one pointer, add 
this pointer to list M; 10. otherwise, if C contains exactly the 
same subclusters pointers as does one of the files Fi from 
among FI . . . Fn, then add a pointer to file Fi to list M; 11. 
otherwise: 12. Select an arbitrary server S2 on the network, 
for example by randomly selecting one of the pointers in 
group C and choosing the server it points to. 13. Send a 
request message to server S2 that includes the subcluster 
pointers in group C and requests server S2 to merge the 
corresponding subcluster trees. 14. Receive a response from 
server S2, containing a pointer to a file G that represents the 
merged tree. Add this pointer to list M. 15. For each file Fi 
from among FI . . . Fn: 16. If list M does not include a 
pointer to file Fi, send a message to the server or servers 
storing Fi instructing them to delete file Fi. 17. Create and 
store a file F that represents a new cluster, whose subclusters 
pointers are exactly the subcluster pointers on list M 18. 
Send a reply message to server SO, which reply message 
contains a pointer to file F and indicates that file F represents 
the merged cluster tree. 

With the help of the above procedure, and the multicast 
tree MT full that includes all proxy servers in the network, 
the distributed hierarchical cluster tree for a particular 
domain of target objects is constructed by merging many 
local hierarchical cluster trees, as follows. 1. One server S 
(preferably one with good connectivity) is elected from the 
tree. 2. Server S sends itself a global request message that 
causes each proxy server in MT /u// (that is., each proxy 
server in the network) to ask its clients for files for the cluster 
tree. 3. The clients of each proxy server transmit to the proxy 
server any files that they maintain, which files represent 
target objects from the appropriate domain that should be 
added to the cluster tree. 4. Server S forms a request Rl that, 
upon receipt, will cause the recipient server SI to take the 
following actions: (a) Build a hierarchical cluster tree of all 
the files stored on server SI that are maintained by users in 
the user base of SI. These files correspond to target objects 
from the appropriate domain. This cluster tree is typically 
stored entirely on SI, but may in principle be stored in a 
distributed fashion, (b) Wait until all servers to which the 
server SI has propagated request R have sent the recipient 
reply messages containing pointers to cluster trees, (c) 
Merge together the cluster tree created in step 5(a) and the 
cluster trees supplied in step 5(b), by sending any server 
(such as SI itself) a message requesting such a merge, as 
described above, (d) Upon receiving a reply to the message 
sent in (c), which reply includes a pointer to a file repre- 
senting the merged cluster tree, forward this reply to the 
sender of request Rl, unless this is SI itself 5. Server S sends 
itself a global request message that causes all servers in 
MT^ to act on embedded request Rl. 6. Server S receives 
a reply to the message it sent in 5(c). This reply includes a 
pointer to a file F that represents the completed hierarchical 
cluster tree. Server S multicasts file F to all proxy servers in 
MTfru. Once the hierarchical cluster tree has been created as 
above, server S can send additional messages through the 
cluster tree, to arrange that multicast trees MT(C) are created 
for sufficiently large clusters C, and that each file F is 
multicast to the tree MT(C), where C is the smallest cluster 
containing file F. 
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VIRTUAL COMMUNITIES AND THE VIRTUAL 
ORGANIZATION 
Matching users for Virtual Communities on the Internet 
Computer users frequently join other users for discussions 

5 on computer bulletin boards, newsgroups, mailing lists, and 
real-time chat sessions over the computer network, which 
may be typed (as with Internet Relay Chat (IRC)), spoken 
(as with Internet phone), or videoconferenced. These forums 
are herein termed "virtual communities." In current practice, 
each virtual community has a specified topic, and users 

1 discover communities of interest by word of mouth or by 
examining a long list of communities (typically hundreds or 
thousands). The users then must decide for themselves 
which of thousands of messages they find interesting from 
among those posted to the selected virtual communities, that 

15 is, made publicly available to members of those communi- 
ties. If they desire, they may also write additional messages 
and post them to the virtual communities of their choice. The 
existence of thousands of Internet bulletin boards (also 
termed newsgroups) and countless more Internet mailing 

20 lists and private bulletin board services (BBS's) demon- 
strates the very strong interest among members of the 
electronic community in forums for the discussion of ideas 
about almost any subject imaginable. Presently, virtual com- 
munity creation proceeds in a haphazard form, usually 

25 instigated by a single individual who decides that a topic is 
worthy of discussion. There are protocols on the Internet for 
voting to determine whether a newsgroup should be created, 
but there is a large hierarchy of newsgroups (which begin 
with the prefix "alt.") that do not follow this protocol. 

30 The system for customized electronic identification of 
desirable objects described herein can of course function as 
a browser for bulletin boards, where target objects are taken 
to be bulletin boards, or subtopics of bulletin boards, and 
each target profile is the cluster profile for a cluster of 

35 documents posted on some bulletin board. Thus, a user can 
locate bulletin boards of interest by all the navigational 
techniques described above, including browsing and query- 
ing. However, this method only serves to locate existing 
virtual communities. Because people have varied and vary- 

40 ing complex interests, it is desirable to automatically locate 
groups of people with common interests in order to form 
virtual communities. The Virtual Community Service (VCS) 
described below is a network-based agent that seeks out 
users of a network with common interests, dynamically 

45 creates bulletin boards or electronic mailing lists for those 
users, and introduces them to each other electronically via 
e-mail. It is useful to note that once virtual communities 
have been created by VCS, the other browsing and filtering 
technologies described above can subsequently be used to 

50 help a user locate particular virtual communities (whether 
pre-existing or automatically generated by VCS); similarly, 
since the messages sent to a given virtual community may 
vary in interest and urgency for a user who has joined that 
community, these browsing and filtering technologies (such 

55 as the e-mail filter) can also be used to alert the user to urgent 
messages and to screen out uninteresting ones. 

The functions of the Virtual Community Service are 
general functions that could be implemented on any network 
ranging from an office network in a small company to the 

60 World Wide Web or the Internet The four main steps in the 
procedure are: 1. Scan postings to existing virtual commu- 
nities. 2. Identify groups of users with common interests. 3. 
Match users with virtual communities, creating new virtual 
communities when necessary. 4. Continue to enroll addi- 

65 tional users in the existing virtual communities. 

More generally, users may post messages to virtual com- 
munities pseudonymously, even employing different pseud- 
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onyms for different virtual communities. (Posts not employ- 
ing a pseudonymous mix path may, as usual, be considered 
to be posts employing a non-secure pseudonym, namely the 
user's true network address.) Therefore, the above steps may 
be expressed more generally as follows: 1. Scan pseudony- 5 
mous postings to existing virtual communities. 2. Identify 
groups of pseudonyms whose associated users have com- 
mon interests. 3. Match pseudonymous users with virtual 
communities, creating new virtual communities when nec- 
essary. 4. Continue to enroll additional pseudonymous users JQ 
in the existing virtual communities. Each of these steps can 
be carried out as described below. 
Virtual Organization 

E-mail Groupware on the Intranet (Intranet applications) 

Another application of Virtual Communities is the appli- 
cation to virtual organizations. Organizations may use the 15 
above described techniques in accordance with their unique 
circumstances of intranet enabled communications involv- 
ing telephony, voice and video conferencing, voice mail 
groupware and e-mail. By enabling users to better 
communicate, route messages by matching users together 20 
with each other or filtering e-mail or voice message, the 
following viable applications apply to the techniques of the 
previously described technologies including matching users 
in virtual communities on the Internet and those described in 
the previous sections. 25 
E-mail Filter 

In addition to the news clipping service described above, 
the system for customized electronic identification of desir- 
able objects functions in an e-mail environment in a similar 
but slightly different manner. The news clipping service 30 
selects and retrieves news information that would not oth- 
erwise reach its subscribers. But at the same time, large 
numbers of e-mail messages do reach users, having been 
generated and sent by humans or automatic programs. These 
users need an e-mail filter, which automatically processes 35 
the messages received. The necessary processing includes a 
determination of the action to be taken with each message, 
including, but not limited to: filing the message, notifying 
the user of receipt of a high priority message, automatically 
responding to a message. The e-mail filter system must not 40 
require too great an investment on the part of the user to 
learn and use, and the user must have confidence in the 
appropriateness of the actions automatically taken by the 
system. The same filter may be applied to voice mail 
messages or facsimile messages that have been converted 45 
into electronically stored text, whether automatically or at 
the user's request, via the use of w ell-known techniques for 
speech recognition or optical character recognition. 

The filtering problem can be defined as follows: a mes- 
sage processing function MPF(*) maps from a received 50 
message (document) to one or more of a set of actions. The 
actions, which may be quite specific, may be either pre- 
defined or customized by the use r. Each action A has an 
appropriateness function F A (*,*) such that F A (U,D) returns 
a real number, representing the appropriateness of selecting 55 
action A on behalf of user U when user U is in receipt of 
message D. For example, if D comes from a credible source 
and is marked urgent, then discarding the message has a high 
cost to the user and has low appropriateness, so that V dLscard 
(U,D) is small, whereas alerting the user of receipt of the 60 
message is highly appropriate, so that V alen (U,D) is large. 
Given the determined appropriateness function, the function 
MPF(D) is used to automatically select the appropriate 
action or actions. As an example, the following set of actions 
might be useful: 65 

1. Urgently notify user of receipt of message and/or insert 
message higher in the queue indicating its priority. 



2. Insert message into queue for user to read later 

3. Insert message into queue for user to read later, and 
suggest that user reply 

4. Insert message into queue for user to read later, and 
suggest that user forward it to individual R where 
individual R's profile indicates that the message is 
relevant to himn/heror suggest that the message be sent 
as a voice mail using text to speech or as a fax or e-mail. 
The message may also be in the form of voice mail or 
voice e-mail. 

5. Summarize message and insert summary into queue 

6. Forward message to user's secretary 

7. File message in directory X 

8. File message in directory Y 

9. Delete message (i.e., ignore message and do not save) 
and/or 

10. Notify sender that further messages on this subject are 
unwanted 

11. Provide a form auto request response that the sender 
of the e-mail (or voice mail) message will be ignored 
(and that it will be deleted). 

12. Send a form auto response to the sender of an e-mail 
message that the user is out of town where the identity 
(or user profile) determines the selection of the 
response message. 

13. Send a form auto response message to an individual 
to which the user does not want to directly reply to. 

14. Similarly provide an auto response voice mail mes- 
sage that is specific (or most relevant) to the identity of 
the caller. 

15. Suggest to the user to authorize a form auto request for 
deletion from a mailing list. Provide an automatic call 
screening function. 

16. Provide an automatic call screening function wherein 
depending on the caller's identity to determine whether 
to allow the call to pass through to the secretary or user 
or to prompt the user to indicate the nature/purpose of 
his/her call using a speech to text conversion module to 
automatically select the most appropriate auto response 
message, whether to forward the call to the user's 
secretary, forward the call directly to the user, or 
automatically page the user, or request that the user not 
call back where these determinations are made based 
upon the identity of the caller and/or the stated objec- 
tives of the call or automatically forward the call to 
another user whose profile is more relevant. In this 
scenario if the user so desires if the call is forwarded 
directly to the user or if the user is paged while the 
caller is holding or if upon the system's determination 
it is forwarded to the user's voice mail, the user may 
identify the caller and/or listen to his/her stated objec- 
tive of the call or automatically inform the caller based 
upon his/her identity and/or stated calling objective not 
to call back (where the voice mail option is not 
provided). 

17. Notify user periodically that message "x" requests and 
warrants a reply due to its urgency and remind users 
periodically. 

18. Automatically recommend to the user a mailing list of 
the most appropriate prospective recipients of a given 
outbound e-mail message. This list is determined by 
both the user's previous e-mail activities regarding 
those prospective recipients and their user profiles as 
well. 
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15 



20 



25 



19. Accordingly suggest to the user a ™angto « 
automatically forward incoming e-mail ^sages 
which have been received wherein the user * not toe 
most appropriate recipient for that message (ig 
priate the forwarding party may also v.ew the profile of 
L recommended recipient(s) prior to proving the 
"commendation). This system may also be used as an 
e-mail router for incoming e-mail or voice mafl coming 
into an organization which occurs automatically or 
upon a human's approval. 
The above appropriateness functions may of course 
insSa firTt be manuaUy entered as if then rules which are 
Sques weQ known in the arL Additional y, the ».»- 

mav be Lstead automatically wntten in which case he user 
™ y approve or rewrite a recommended appropriateness 

SSu. the message exceeds value X perform appropnate- 

"XdSS plications of the present methods are cop- 
ce^Sor example in the case of sending, forwarding for 
re garding) message to users based upon appropriateness 
foncdons relating to the profiles of the message and pro- 
ST^cSSs, it is possible to use this technique to 
Xw le^to more efficiently submit queries for response 
^^within any intranet, an 

(extranet) or the Internet. An example application of the 
scenario is as follows: 

1 A newbie submits a query by web or e-mail. 

2 The engine shows the user a few answers such that 
similar newbie, query, answer triples have _ been highly 
rated. (One kind of answer consists of nothing but the 
URL of a helpful site!) 

3 If the user finds these answers unsatisfactory, the engine 
takes note of this feedback. Then it goes to plan B, and 
finds a few experts such that the newbie, query, expert, 
time-of-day tuples have been highly rated. 

4. The system offers aU of these experts the question by 

5 Ftat Expert to indicate interest in the offer (by replying 

"yes") gets a go-ahead from the system. 
6. Expert replies by sending an answer from the system. 
S/he may reply if further dialogue is con- 
versation can continue in this way indefinite y Of 
course it aU goes through the system, so it s all 
pseudonymous and logged. (Sometimes the «>nespon- 
dfnee .my go off-topic. There should be a mechanism 
for dealing with thS. so that rambling (or personal 
discussion won't appear in it's entirety as pa* L of taU 
database E e , If I want to go off-topic with my next 
mctS 'tadf I* system .then forwards the message 
aTutuf, but with my real return address as the Reply-to 
field Further correspondence (if the other correspon- 
dent chooses to reply) then occurs with real names and 
outside the system.) 

8 The newbie rates the quality of the dialogue as a 
precondition for being allowed to ask more question^ 
(the expert is allowed to rate it too, sojhat the system 
knows which questions the expert LIKES to answer, 
not just which ones s/he WILL answer.) 

9 If the dialogue never took place, because some expert 
9 ' "plTed in Jep 5 but didn't continue to step6 witoin a 

reasonable time, the system sends a go-abead to he 
next most appropriate of the experts who indicated 65 
interest in step 5. It also does this if the newbie got an 
answer but said (in step 8) that it was unhelpful. (In the 



30 



3S 



40 



50 



letter case the system might allow the newbie to edit 
Se qu^fiS The editedVry -uld be included ,n 
the so-ahead to the next expert.) 
10. If I step 5 or step 9 none of the <«»^ D ?^£ 
have indicated interest, within a reasonable^ jtune . riter 
the question was originally posed, then the .system 
slowly offers the question to more experts (as in Step 
4) up to a reasonable limit, until it does get a bite. 
11 Any expert who received a request but ignored it gets 
a relevance feedback value of 0 for that query. Any 
expert X gave a go-ahead, but didn't get to answer 
For choosing an expert, some interesting ; attributes of 
^expert are usual time to respond, length of response 
count of technical term in response-since different 
users may have different sensitivities to these factors. 
Also the text of queries/list of ^^^^ 
answered, what clusters of newbies has rated them 
Sy. etc. Finally, the set of terms .n their exphcit 
declarations of interest, and in their refuses: Jus 
helps cluster them both with queries and W!* other 
experts. If we had a billing mechanism (which would 
probably require collaboration with AFL or someone 
since its cuVrently hard to collect from a user who only 
spends $l/month on queries), here would be a rough 
pricing model: When a question lands on your desk, it 
Lnes 8 accompanied by an offer of paymer ■ . Sortie 
system looks for an expert, price pair such thai i newbie, 
query, expert, price, time-of-day id highly rated, mean- 

roS expert is likely to answer this question for this price 

tbfenSrwill be satisfied with the tradeoff between 

TU. oJE^ Sunns of getting offered pries to 
SctuTtfcorrectly. I. does mean thaj .it's hare 1 to , ower^our 
rates (in a particular area) once the system has decided 
you're expensive and stopped sending you queries, but there 
Tways^round mis. (e.g. you could always actively notify 
the system of your new approximate rates, either out of the 
^or when responding to a request, * addUioMb, = system 
might e-mail inactive experts every so often, astangU they 
want to lower their rates, declare additions I interest .be 
dropped from the rolls, etc.). There is also a **f ^ 
model, which is presumably the best way to start. It might 
involve some or all of these elements: 

Get nice idealists to participate as experts by aaVertising 
on Usenet(and/orby actually seeding the da abase :wrth 
Usenet postings from selected groups, so that people 
n^y beeper* without knowing it). I think there are 
some people who would participate freely given that 
only a few people have to sec each question, so they 
won't get many-it would reduce Usenet traffic, where 
Teryone has to see all the questions-the answe 
would be permanently on file, and they could sign it 
(good for visibility!) 
if they ignore the questions they'll just go away. 
Attract advertising. , 
The benign kind of advertising: plugs in signs and on 

ThTsVa^kind: A query about word processors or 
WordPerfect is highly likely to draw an on-file 
"expert" response touting Microsoft Word 
The semi-sleazy kind: the expert responses to die query 
are uncompromised (they're genuinely highly rated) 
but an Soft advert labeled such ,s jmached 
(Apparently IBM bought the queries Microsoft 
and "gates" on Lycos!) 
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Use play money. By answering questions, you can build document sender, date sent, document length, date of last 
up credit that you can use to ask questions. However, document received from this sender, key words, list of other 
if you go too deeply into debt, you have to fork over addressees, etc. It was disclosed above how to estimate an 
real money (or accept advertising). If you go well into interest function on profiled target objects, using relevance 
profit, you can cash in. One could imagine eventually 5 feedback together with measured similarities among target 
using this system as the seed of a VCS chat service, objects and among users. In the con text of the e-mail filter, 
where queries consisted of topics advertisements. the task is to estimate several appropriateness functions F A 
(We'd just have to allow new people to get added to (*,*)» one per action. This is handled with exactly the same 
existing conversations.) It's also a good way for method as was used earlier to estimate the topical interest 
consultants, brokers, mechanics, etc., to advertise their 10 function £(*,*). Relevance feedback in this case is provided 
expertise (remember that answers can be paid for by the user's observed actions over time: whenever user U 
on-line, or a negotiation taken off-line). And for the chooses action A on document D, either freely or by choos- 
same reason, I could ail-too -easily imagine it replacing ing or confirming an action recommended by the system, 
1-900 phone sex numbers. (Hey-ratings, price and all!) this is taken to mean that the appropriateness of action A on 
This matching criteria includes interest attributes. 15 document D is high, particularly if the user takes this action 
Though this market model is useful for the above A immediately after seeing document D. A presumption of 
example it is readily applicable to any of the afore- no appropriateness (corresponding to the earlier presump- 
mentioned applications (to retrieving information, tion of no interest) is used so that action A is considered 
human experts, employers and employees, buyers and inappropriate on a document unless the user or similar users 
sellers, and may be applied likewise to any product, 20 have taken action A on this document or similar documents, 
commodity, share or interest that may be exchanged in In particular, if no similar document has been seen, no action 
an open market, e.g., stocks, commodities, insurance is considered especially appropriate, and the e-mail filter 
policies, products (bought and sold or bartered). asks the user to specify the appropriate action or confirm that 
Domains of application for the Internet-wide market the action chosen by the e-mail filter is the appropriate one. 
system (such as legal counseling, medicine, 25 Thus, the e-mail filter learns to take particular actions on 
engineering, psychological/sociological services, com- e-mail messages that 3-have certain attributes or combina- 
puter solutions) as well as more subjective domains tions of attributes. For example, messages from John Doe 
such as architectural design, product design, document that originate in the (212) area code may prompt the system 
authoring, landscaping, decor (personalized fashion to forward a copy by fax transmission to a given fax number, 
design) and cosmetics as well as informal solutions to 30 or to file the message in directory X on the user's client 
problems of individuals based on their unique life and processor. A variation allows active requests of this form 
professional experiences, and encounters. Additionally, from the user, such as a request that any message from John 
some experts may choose to use a filtering functionality Doe be forwarded to a desired fax number until further 
on their system with preset parameters such as the price notice. This active user input requires the use of a natural 
of a given task must meet a preset minimum to qualify. 35 language or form-based interface for which specific corn- 
Notice that actions 8 and 9 in the sample list above are mands are associated with particular attributes and combi- 

designed to filter out messages that are undesirable to the nations of attributes. 

user or that are received from undesirable sources, such as Scanning 

pesky salespersons, by deleting the unwanted message and/ Using the technology described above, Virtual Commu- 

or sending a reply that indicates that messages of this type 40 nity Service constantly scans all the messages posted to all 

will not be read. The appropriateness functions must be the newsgroups and electronic mailing lists on a given 

tailored to describe the appropriateness of carrying out each network, and constructs a target profile for each message 

action given the target profile for a particular document, and found. The network can be the Internet, or a set of bulletin 

then a message processing function MPF can be found boards maintained by America Online, Prodigy, or 

which is in some sense optimal with respect to the appro- 45 CompuServe, or a smaller set of bulletin boards that might 

priateness function. One reasonable choice of MPF always be local to a single organization, for example a large 

picks the action with highest appropriateness, an d in cases company, a law firm, or a university. The scanning activity 

where multiple actions are highly appropriate and are also need not be confined to bulletin boards and mailing lists that 

compatible with each other, selects more than one action: for were created by Virtual Community Service, but may also be 

example, it may automatically reply to a message and also 50 used to scan the activity of communities that predate Virtual 

file the same message in directory X, so that the value of Community Service or are otherwise created by means 

MPF(D) is the set\{reply, file in directory X\}. In cases outside the Virtual Community Service system, provided 

where the appropriateness of even the most appropriate that these communities are public or otherwise grant their 

action falls below a user-specified threshold, as should permission. 

happen for messages of an unfamiliar type, the system asks 55 The target profile of each message includes textual 

the user for confirmation of the action(s) selected by MPF. attributes specifying the title and body text of the message. 

In addition, in cases where MPF selects one action over In the case of a spoken rather than written message, the latter 

another action that is nearly as appropriate, the system also attribute may b e computed from the acoustic speech data by 

asks the user for confirmation: for example, mail should not using a speech recognition system. The target profile also 

be deleted if it is nearly as appropriate to let the user see it. 60 includes an associative attribute listing the authors) and 

It is possible to write appropriateness functions manually, designated recipient(s) of the message, where the recipients 

but the time necessary and lack of user expertise render this may be individuals and/or entire virtual communities; if this 

solution impractical. The automatic training of this system is attribute is highly weighted, then the system tends to regard 

preferable, using the automatic user profiling system messages among the same set of people as being similar or 

described above. Each received document is viewed as a 65 related, even if the topical similarity of the messages is not 

target object whose profile includes such attributes as the clear from their content, as may happen when some of the 

entire text of the document (represented as TF/IDF scores), messages are very short. Other important attributes include 
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ihe fraction of the message that consists of quoted material 
from previous messages, as well as attributes that are 
generally useful in characterizing documents, such as the 
message's date, length, and reading level. 
Virtual Community Identification 

Next, Virtual Community Service attempts to identify 
groups of pseudonymous users with common interests. 
These groups, herein termed "pre-communities," are repre- 
sented as sets of pseudonyms. Whenever Virtual Community 
Service identifies a pre -community, it will subsequently 
attempt to put the users in said pre-community in contact 
with each other, as described below. Each pre -community is 
said to be "determined" by a cluster of messages, pseud- 
onymous users, search profiles, or target objects. 

In the usual method for determining pre-communities, 
Virtual Community Service clusters the messages that were 
scanned and profiled in the above step, based on the simi- 
larity of those messages' computed target profiles, thus 
automatically finding threads of discussion that show com- 
mon interests among the users. Naturally, discussions in a 
single virtual community tend to show common interests; 
however, this method uses all the texts from every available 
virtual community, including bulletin boards and electronic 
mailing lists. Indeed, a user who wishes to initiate or join a 
discussion on some topic may send a "feeler message" on 
that topic to a special mailing list designated for feeler mess 
ages; as a consequence of the scanning procedure described 
above, the feeler message is automatically grouped with any 
similarly profiled messages that have been sent to this 
special mailing list, to topical mailing lists, or to topical 
bulletin boards. The clustering step employs "soft 
clustering,** in which a message may belong to multiple 
clusters and hence to multiple virtual communities. Each 
cluster of messages that is found by Virtual Community 
Service and that is of sufficient size (for example, 10-20 
different messages) determines a pre-community whose 
members are the pseudonymous authors and recipients of 
the messages in the cluster. More precisely, the pre- 
community consists of the various pseudonyms under which 
the messages in the cluster were sent and received. 

Alternative methods for determining a pre-community, 
which do not require the scanning step above, include the 
following: 1. Pre-communities can be generated by grouping 
together users who have similar interests of any sort, not 
merely Individuals who have already written or received 
messages about similar topics. If the user profile associated 
with each pseudonym indicates the user's interests, for 
example through an associative attribute that indicates the 
documents or Web sites a user likes, then pseudonyms can 
be clustered based on the similarity of their associated user 
profiles, and each of the resulting clusters of pseudonyms 
determines a pre-community comprising the pseudonyms in 
the cluster. 2. If each pseudonym has an associated search 
profile set formed through participation in the news clipping 
service described above, then all search profiles of all 
pseudonymous users can be clustered based on their 
similarity, and each cluster of search profiles determines a 
pre-community whose members are the pseudonyms from 
whose search profile sets the search profiles in the cluster are 
drawn. Such groups of people have been reading about the 
same topic (or, more generally, accessing similar target 
objects) and so presumably share an interest. 3. If users 
participate in a news clipping service or any other filtering 
or browsing system for target objects, then an individual 
user can pseudonymously request the formation of a virtual 
community to discuss a particular cluster of one or more 
target objects known to that system. This cluster of target 
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objects determines a pre-community consisting of the pseud- 
onyms of users determined to be most interested in that 
cluster (for example, users who have search profiles similar 
to the cluster pro file), together with the pseudonym of the 

5 user who requested formation of the virtual community. 
Matching Users with Communities 

Once Virtual Community Service identifies a cluster C of 
messages, users, search profiles, or target objects that deter- 
mines a pre-community M, it attempts to arrange for the 

to members of this pre-community to have the chance to 
participate in a common virtual community V. In many 
cases, an existing virtual community V may suit the needs of 
the pre-community M. Virtual Community Service first 
attempts to find such an existing community V In the case 

15 where cluster C is a cluster of messages, V may be chosen 
to be any existing virtual community such that the cluster 
profile of cluster C is within a threshold distance of the mean 
profile of the set of messages recently posted to virtual 
community V; in the case where cluster C is a cluster of 

20 users, V may be chosen to be any existing virtual community 
such that the cluster profile of cluster C is within a threshold 
distance of the mean user profile of the active members of 
virtual community V; in the case where the cluster C is a 
cluster of search profiles, V may be chosen to be any existing 

25 virtual community such that the cluster profile of cluster C 
is within a threshold distance of the cluster profile of the 
largest cluster resulting from clustering all the search pro- 
files of active members of virtual community V; and in the 
case where the cluster C is a cluster of one or more target 

30 objects chosen from a separate browsing or filtering system, 

V may be chosen to be any existing virtual community 
initiated in the same way from a cluster whose cluster profile 
in that other system is within a threshold distance of the 
cluster profile of cluster C. The threshold distance used in 

35 each case is optionally dependent on the cluster variance or 
cluster diameter of the profile sets whose means are being 
compared. 

If no existing virtual community V meets these conditions 
and is also willing to accept all the users in pre-community 

40 M as new members, then Virtual Community Service 
attempts to create a new virtual community V. Regardless of 
whether virtual community V is an existing community or a 
newly created community, Virtual Community Service 
sends an e-mail message to each pseudonym P in pre- 

45 community M whose associated user U does not already 
belong to virtual community V (under pseudonym P) and 
has not previously turned down a request to join virtual 
community V The e-mail message informs user U of the 
existence of virtual community V, and provides instructions 

50 which user U may follow in order to join virtual community 

V if desired; these instructions vary depending on whether 
virtual community V is an existing community or a new 
community. The message includes a credential, granted to 
pseudonym P, which credential must be presented by user U 

55 upon joining the virtual community V, as proof that user U 
was actually invited to join. If user U wishes to join virtual 
community V under a different pseudonym Q, user U may 
first transfer the credential from pseudonym P to pseudonym 
Q, as described above. The e-mail message further provides 

60 an indication of the common interests of the community, for 
example by including a list of titles of messages recently 
sent to the community, or a charter or introductory message 
provided by the community (if available), or a label gener- 
ated by the methods described above that identifies the 

65 content of the cluster of messages, user profiles, search 
profiles, or target objects that was used to identify the 
pre-community M. 
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If Virtual Community Service must create a new commu- whether or not a virtual meeting (or transcript thereof 

nity V, several methods are available for enabling the should be made accessible to each employee in the organi- 

members of the new community to communicate with each zation based upon access privileges to particular types of 

other. If the pre-community M is large, for example con- content granted in the past and other aspects of his/her 

taming more than 50 users, then Virtual Community Service 5 profile. This technique may be applied more generally as 

typically establishes either a multicast tree, as described well to augment access control to information by employees 

below, or a widely-distributed bulletin board, assigning a in the organization in general. 

name to the new bulletin board. If the pre-community M has In accordance with currently used methods, voice and fax 
fewer members, for example 2-50, Virtual Community numbers may change dynamically in accordance with the 
Service typically establishes either a multicast tree, as 10 user's physical location. Specifically users should first be 
described below, or an e-mail mailing list. If the new virtual matched according to their common interest in a type of 
community V was determined by a cluster of messages, then application which can be jointly interacted with or jointly 
Virtual Community Service kicks off the discussion by viewed passively (via PC or TV). Then, secondly, users 
distributing these messages to all members of virtual com- within such a common interest group may be further sub- 
munity V. In addition to bulletin boards and mailing lists, 15 divided into sub-communities according to more specific 
alternative for that can be created and in which virtual common interests which they share (such as sub- 
communities can gather include real-time typed or spoken communities) of real time correspondents simultaneously 
conversations (or engagement or distributed multi-user watching a popular program on television or according to 
applications including video games) over the computer content profile of the real time dialogues which the users are 
network and physical meetings, any of which can be sched- 20 engaged in e.g., as they jointly navigate the World Wide 
uled by a partly automated process wherein Virtual Com- Web, view a video program or television debate or engage 
munity Service requests meeting time preferences from all in a video game. Conversely where the forum is smaller 
members of the pre-community M and then notifies these and/or the objectives are more objectively identified, sub- 
individuals of an appropriate meeting time. interest groups may be irrelevant, for example, on-line 
For multi user applications, users may be matched 25 seminars, organizational meetings or board meetings in 
together who share a high level of interest in that application which relevant users whose presence or participation is 
or the particular type of content therein as with educational requested may be automatically scheduled (by a scheduling 
software, entertainment applications or groupware (e.g., agent) in advance or the user may be notified or paged if 
intra-organizational) where users may participate remotely topical relevancy to the user's interest (or professional 
in an application. Any of these multi-user applications may 30 interest) profile is identified in real time by the VCS agent 
involve automatic calendaring (by a scheduling agent) for initially (or throughout the course of the meeting), 
the purpose of arranging a virtual session between users who Continued Enrollment 

share a common interest in the nature or content of the Even after creation of a new virtual community, Virtual 

application (e.g., a high speed action or suspense adventure Community Service continues to scan other virtual commu- 

video game) or for some applications (e.g., document edit- 35 nities for new messages whose target profiles are similar to 

ing groupware) users may sometimes require synchronous the community's cluster profile (average message profile), 

sessions or they may participate asynchronously. Copies of any such messages are sent to the new virtual 

Conversely, users who are currently engaged in a multi user community, and the pseudonymous authors of these 

session may allow the VCS agent to notify or page remote messages, as well as users who show high interest in reading 

users who may be interested in participating as in entertain- 40 such messages, are informed by Virtual Community Service 

ment type applications or whose presence (or contribution) (as for pre-community members, above) that they may want 

they feel is needed as with groupware used in an organiza- to join the community. Each such user can then decide 

tional or professional context (such as with on-line whether or not to join the community. In the case of Internet 

conferencing, whiteboarding, document editing, virtual cor- Relay Chat (IRQ, if the target profile of messages in a real 

porate meetings, etc.). Matching together users in these 45 time dialog are (or become) similar to that of a user, VCS 

applications assumes that within the current session, pro- may also send an urgent e-mail message to such user 

spective participants share the same (or similar) application whereby the user may be automatically notified as soon as 

thus are profiled accordingly to the nature of the application, the dialog appears, if desired. 

the list of current participants and if relevant secondarily to With these facilities, Virtual Community Service provides 

the content of the interacting user's dialogs (such as text or 50 automatic creation of new virtual communities in any local 

voice chat). or wide- area network, as well as maintenance of all virtual 

Specifically users are likely to have a common interest in communities on the network, including those not created by 

the nature of an application which can be jointly (passively) Virtual Community Service. The core technology under ly- 

interacted with or jointly viewed such as the content of the ing Virtual Community Service is creating a search and 

document being edited, the profile of a video being viewed 55 clustering mechanism that can find articles that are "similar" 

or a site being visited by a group of users collaboratively in that the users share interests. This is precisely what was 

navigating the WWW (or intra-organizational Web). A use- described above. One must be sure that Virtual Community 

ful approach to advertising in a virtual chat room, confer- Service does not bombard users with notices about cornmu- 

ence or multi-user application is using the current temporal nities in which they have no real interest. On a very small 

profile of the collaborative interaction as a target profile for 60 network a human could be "in the loop", scanning proposed 

which to target ads in real time and dynamically change the virtual communities and perhaps even giving them names, 

ad presentation as the topical relevance of the interaction But on larger networks Virtual Community Service has to 

changes, which is then viewed by all of the collaborative run in fully automatic mode, since it is likely to find a large 

participants simultaneously. In a variation using similar number of virtual communities, 

techniques to those used in the above e-mail filter section, 65 Delivering Messages to a Virtual Community 

one appropriateness function which the system could write Once a virtual community has been identified, it is 

could be recommending to a user (such as an employer) straightforward for Virtual Community Service to establish 
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a mailing list so that any member of the virtual community APPENDED COLLABORATIVE COMPUTING 

may distribute e-mail to all other members. Another method APPLICATIONS 

of distribution is to use a conventional network bulletin 1. Automatic Retrieval and Assembly of Work Groups 

board or newsgroup to distribute the messages to all servers A company often requires a team of staUed petsonne 

in the networkTWere they can be accessed by any member s (whose qualmcaUons axe spec.fi icaUysiu ted I to die task at 

ofmevirmalcommumty.However.thesesimplemethodsdo band). For large corrx,raUons ,t is difficult to keep track of 

, 7 , , f r . skill sets of its own internal employees. Conversely for small 

not take mto account cost and performance advantages £ J de fa ofteD m 

which accrue from optimizing the construcUon of a multi- ^ * ^ QTg ^ such accordingly . 

cast tree to carry messages to the virtual community. UrJike ^ ^ ^ os& q{ {h& system fe to emulate expeft human 

a newsgroup, a multicast tree distributes messages to only a organizers of work teams and make recommendations as to 

selected set of servers, and unlike an e-mail mailing list, it ^ mQsl appropriately qualified team of available people for 

does so efficiently. me gj ven at nan d based upon the suted objectives and 

A separate multicast tree MT(V) is maintained for each required tasks of a prospective project. Using the presently 

virtual community V, by use of the following four proce- is described technique using relevance feedback it is possible 

dures. 1. To construct or reconstruct this multicast tree, the to match the profile of the project with that of the available 

core servers for virtual community V are taken to be those pool of individuals. The organizer may wish to keep in mind 

proxy servers that serve at least one pseudonymous member a variety of considerations in selecting teams for example 

of virtual community V. Then the multicast tree MT(V) is considering a variety of qualifications, psychographics and 

established via steps 4-6 in the section "Multicast Tree 20 attributes pertaining to the user's profile as developed from 

Construction Procedure" above. 2. When a new user joins his/her professional on-line activities and interactions. In 

virtual community V, which is an existing virtual view of the fact that some skill requirements exist in 

community, the user sends a message to the user's proxy overlapping disciplines, that the more diversity 

server S. If user's proxy server S is not already a core server (complementarity) of skills of its members, the greater the 

for V, then it is designated as a core server and is added to 25 likelihood of covering the (important) skill requirements 

the multicast tree MT(V), as follows. If more than k servers adequately (suggesting that the greater the complementarity 

have been added since the last time the multicast tree MT(V) of attributes characterizing the users base of qualifications 

was rebuilt, where k is a function of the number of core and information content interaction the more synergistic the 

servers already in the tree, then the entire tree is simply work process). Another consideration may be to find the 

rebuilt via steps 4-6 in the section "Multicast Tree Con- 30 fewest number of individuals as possible who collectively 

struction Procedure" above. Otherwise, server S retrieves its cover the apparent skill requirements. Still another consid- 

locally stored list of nearby core servers for V, and chooses eration is to favor the reorganizing of groups which had 

a server SI. Server S sends a control message to SI, previously proved themselves by arriving at a successful 

indicating that it would like to be added to the multicast tree solution or product to a similar problem or task. Through out 

MT(V). Upon receipt of this message, server SI retrieves its 35 the work process certain sub-problems may require tempo - 

locally stored subtree Gl of MT(V), and forms a new graph rary consultation with appropriately qualified individuals 

G from Gl by removing all degree-1 vertices other than SI who are more qualified than members of the present team, 

itself. Server SI transmits graph GtoserverS, which stores Each member of a virtual work group (whether lntra- 

it as its locally stored subtree of MT(V). Finally, server S organizational or inter-organizational) maybe prescribed 

sends a message to itself and to all servers that are vertices 40 attributes by a superior such as credentials, observed skill 

of graph G, instructing these servers to modify their locally sets from past experiences and psychographics. This method 

stored subtrees of MT(V) by adding S as a vertex and adding may also be used to observe patterns relating to what types 

an edge between SI and S. 3. When a user at a client q of users are granted access to what type of informational 

wishes to send a message F to virtual community V, client content e.g., some members of a team may be made pnvy to 

q embeds message F -in a request R instructing the recipient 45 some information which others are not in accordance with 

to store message F locally, for a limited time, for access by the methods suggested above. The system may present 

member s of virtual community V. Request R includes a recommendations which restrict what types of data can be 

credential proving that the user is a member of virtual accessed by that user. Some restriction attnbutes may be 

community V or is otherwise entitled to post messages to explicit indicating documents containing which word 

virtual community V (for example is not "black marked" by 50 attributes a user is forbidden to access. For others the 

that or other virtual community members). Client q then restriction may be based upon explicit criteria for example 

broadcasts request R to all core servers in the multicast tree including documents containing words which tend to 

MT(V), by means of a global request message transmitted to co-occur (are metrically close) to those explicitly men- 

the user's proxy server as described above. The core servers tioned. Or relative attribute weighting values may be used as 

satisfy request R, provided that they can verify the included 55 thresholds for determining automatically user document 

credential. 4. In order to retrieve a particular message sent to access and privileges. Another (appropriateness function) 

virtual community V, a user U at client q initiates the steps based criteria which may be used as well is the similarity 

described in section "Retrieving Files from a Multicast measure between the document and user profiles. In this case 

Tree," above. If user U does not want to retrieve a particular it may be useful to automatically generate explicit rules 

message, but rather wants to retrieve all new messages sent 60 which may present the user profile (with relative attribute 

to virtual community V, then user U pseudonymously weights) as well as that of the document to the authorized 

instructs its proxy server (which is a core server for V) to decision maker. Additionally, as suggested in the e-mail 

send it all messages that were multicast to MT(V) after a filter section (above) a fully trained system may additionally 

certain date. In either case, user U must provide a credential automatically present the rules (appropriateness functions) 

proving user U to be a member of virtual community V, or 65 which it has written through passive training. Thus the user 

otherwise entitled to access messages on virtual community may again (in this case for automatically determining docu- 

y ment access privileges) approve the rules presented or 
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modify them according*. *^ * SS^eSof methods may be used for retrieving 

cases be retrievable in segments (which lack forbidden ^ b y organizations to determine the relevance of 

terms) if the authorized credential granting party so allows, doamen y S ^ fax> le , h d 

in me case where documents or document segments and the physical dialogues) to the interests of the user as 

co^sponding individuals are ascribed manual restriction 5 £™ ^ M ^ t correspondences 

Stes! user restriction attributes may act as a restriction S, ^ tered out .Want ones may accordingly be clustered 

toSaS to or deleting from relevant documents (or jjjj and organized into a 

Lments) or prohibiting access altogether as suggested. \ iDduslrial browsing M above described. Fo "f 

^restrictions may be integrated with the document file , ^ „ "hs.en in" on cerwn types of 

Sha authorization credentials may act as a decryption to J^^ccs with a particular client b -a pamcula 

f «>ai duujui transmitted or conveyed else- emolovee (via phone number and voice ID using Neural rrei 

ga^i . in 

present technique can be usefully «° ^ ^ fdvis^dSlariy ta Torfer to enable the system to develop 

applications of the virtual dialogues (live or mdexed gi veD employee for both outgoing 

horded) Le., matching users with virtual meetogs « rid. "jg-gjjj^ c g ach ^ communication media 

above e-mail and telephony router in both ca^wterein and in ^ of rou ,ing. 

users are granted attribute based pnv.leges to access J.or winch u > ^/designate particular clusters to be 

denial for accessing) certain dialogues ,n accordance with 20 A WP^^ relevan , or irre levanl to a given 

their content. . - s employment duties. For any employee a summary 

In one exemplary approach, a virtual work group is user s emp y ^ automatically e . mailed 

assembledforengineeringaproduc,,manotherauUto^ JP^Uy and/or notification made if, for example, a 

editing a document, in another arrive at «rrx,rate pohcy for pe omc y cxceeds 

a particular need or unresolved issue or for me purposes of 25 ce am ^ ^ ^ . 

citing virtual breakout sessions witrun an on- me con er- c ^™ delected or man uaUy entered key word detecnon 

ence (multi organizational) or corporate meeting. Many P^™^ certain instances . In this regard, the attnbute 

other examples are possible. of time may be usefuil in determinmg whether or not 

2. Virtual Meetings mt „ arn , K ,n irrelevant dialogues are occurring during scheduled worK 

Particularly withm large organ^ions.it^ adv^ ag^us 30 uretevant o gu o£ ^ are 

,o disseminate company (inside) news and mformatioc to nme* ^ ^ ^ chlster of inler est may ako 

those employees for whom the information a ; valuable adduwna, ^ ^ Qf ^ 

Using tesarne basic profiling techniques (above). Vernal be ™ an a licalion variatio n the present tecb- 

dialogues (either physical meetings or entirely vulual ^° r ^ be ^ d for purpoS es of monitormg acuviues and 

mee tings, ^« ° r tde P h ° n ?, ^l^ivc index- generaToehavior of chubby parents or on-line scholastic 

matically profiled on the fly and used for « s P°^ c ^ d « ffiLfoa) behavior by teachers. Phone companies may 

ing and notification of those users to whom the » uformatwo f^Lly / his tcchniqU6 to better monitor communications 

is valuable (and to whom it is privy). As the content of such abo . a pp y acliviti&s In each of the above 

a dialogue may change with time, new use* may* „ oratt ern.Uve feature fe the abdily 

prompted to join while others may be prompted or _a terna 40 appuca ^ , restrictions on particular 

uvely (for confidentiality reasons) may ^ mandated to ofjhe authonUU p y r.^ ^ mcn|i ^ 

s^SSrfHS - * 

cluster menu trees from previous patent). ™* '~ hl »f ^* '° ^nbeTof schools) may be accessible for participation 

useful for intuitive browsing of large archive ; of tins large om lectures> continuing education 

information). Digital credentials may be V*^*«* 50 conferences, tutorials for job training (or on-going 

employee by superiors which indicate for hmVh* the spe^ require ments) may apply. The most exemplary 

cific information contexts (by clusters) which are joe £ ^ rtual classr0 om. Students may 

mandatory, which are recommended, which are "eu.raU and >PPbauon ho ^ of ? 

^^m^^^} 0 ^ 10 ^^. 55 p^tXtopc or problems 8 or a query. Tbe system wul 

(for the mandatory credentiaO requtre also mandatory (teal 55 P^ P ^ appropriate on-Une lecture either hve 

time) attendance. A scheduling agent maybe used to orga- wijteto S lterac , (e.g., recommending the next 

nize meeting times in advance by «»^.^ SSiSd le) or the most appropriate pre-recorded lec 

tbe most relevant users as to the stated obj*c jves o I the scheam ; ^ ^ lving 

meettag-TUsisdonebycoordmatingavadabletmiesloteto ^ ejor so ^ orded sing i e (closed) session or 
optimize the availabifity of the most number of ^ h |^ 60 S^dLt seJon may be presented similarly) or he 

relevance users to the dialogue (the user may also imhcate ™J " receive a recommendation of the name of the 

among his/her avaflable time the level of convenience as ^^^ ^d faculty or student recommended 

well). As above suggested, in virtual work groups a virtual mos stoue P- ^ the studenl may either 

meeting's objective may be to solve a particular prcWem on #J P 1D the lecturer (throughout the 

and develop a strategy, plan or fo^l'te fated ^ bjectrc preset q ^ ^ ^ on£s may 

of which may be used to index a virtual group whose B H atof 

complement and skills provides an optimal solution thereto. be selected Dy 
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Additionally, ifAvhen desired, sub-dialogues may occur 
between attendees in the absence of the others. This is also 
one application of joint user navigation as the presenter 
(lecturer or student presenting a question) presents 
questions, content or solutions or navigates through infor- 
mational spaces in joint collaborative fashion for all attend- 
ees (or those designated by the presenter). 

In one variation students who are most in need of a 
definable domain (attribute/cluster) indicated by their 
request or lack of proficiency as evidenced in quiz or test 
scores may be matched with offer students or tutors who are 
proficient in those areas. In one approach students may be 
matched for purposes of collaborative study sessions in 
which priority is given to those which possess the greater 
degree of complementarity within their respective domains 
of proficiency/deficiency. The present clustering model may 
further facilitate the predictors accuracy of the content 
domains in which a student is expect to be proficient. For 
example, in the pure clustering model, it is possible to make 
associations between which domains a student is LIKELY to 
be proficient in according to areas of previous proficiency 
(within the same class of different ones based upon historical 
data from previous students). It can be appreciated that the 
present system may readily be applied also to corporate or 
professional application including organizational training 
sessions, continuing education or conference seminars. 

5. Virtual Communities Developed Around Product Genres, 
Categories, or Items. 

The most "interested" users for a particular topic or target 
object (e.g., or limited to selected exemplary target objects) 
may be automatically matched for a virtual dialogue which 
is accessible directly from the target object of interest while 
browsing. This virtual dialogue includes standard bbs, IRC, 
Internet telephone and video telephony. Applications include 
store front products (and categories), musical albums, 
movies, stocks (or mutual funds). In one approach the 
criteria for creating a virtual group of watching people one 
on one is to find among "similar** users the greatest degree 
of complimentarity (difference) in their respective experi- 
ences. Thus optimizing the conditions for the users to share 
invaluable knowledge between one another business 
venture, a regional or national economy. 

6. (Ancillary Inclusion) Hybrid TV/PC 

In TV units which have integrated dual mode capabilities 
for TV and PC functionalities simultaneously (e.g., viewing 
TV programming while sending/receiving e-mail) the VCS 
agent may be used not only to point the users to the most 
appropriate TV programming for their interest at any given 
time (selectively refer to and/or transcribe Home Video Club 
patent) but it may also bring the participating views of a 
program to the attention of each other thus allowing viewers 
to exchange comments or share perspectives about the 
programming before, during or after the program. Within 
these user circles VCs may further narrow the criteria of 
interacting users by their specific viewing profiles. 

7. Physical Meetings 

In one exemplary approach VCS organized communities 
may meet in physical forums (e.g., where all the members 
are required to live in a physically close proximity as a 
prerequisite for matching) for example organizing meetings 
or according to general criteria (e.g., socializing and gath- 
ering in a restaurant/night club, concert or movie theater) or 
alternatively wherein a human or machine designated theme 
is the basis for the community for example a meeting around 
a political or a community related issue, an item of common 
interest within a large organization, a vacation destination 
(which all of the members are likely to wish to visit in the 



19,195 

96 

future and wherein a date could be scheduled using a 
scheduling agent for a group tour). Such a community could 
for example, be developed around such a travel destination 
as part of a travel agent's Web site as a marketing pitch for 
5 soliciting a trip. 

SUMMARY 

A method has been presented for automatically selecting 
articles of interest to a user. The method generates sets of 

1Q search profiles for the users based on such attributes as the 
relative frequency of occurrence of words in the articles read 
by the users, and uses these search profiles to efficiently 
identify future articles of interest. The methods is charac- 
terized by passive monitoring (users do not need to explic- 
5 itly rate the articles), multiple search profiles per user 
(reflecting interest in multiple topics) and use of elements of 
the search profiles which are automatically determined from 
the data (notably, the TF/IDF measure based on word 
frequencies and descriptions of purchasable items). A 

20 method has also been presented for automatically generating 
menus to allow users to locate and retrieve articles on topics 
of interest. This method clusters articles based on their 
similarity, as measured by the relative frequency of word 
occurrences. Clusters arc labeled either with article titles or 

25 with key words extracted from the article. The method can 
be applied to large sets of articles distributed over many 
machines. 

It has been further shown how to extend the above 
methods from articles to any class of target objects for which 

30 profiles can be generated, including news articles, reference 
or work articles, electronic mail product or service 
descriptions, people (based on the articles they read, demo- 
graphic data, or the products they buy), and electronic 
bulletin boards (based on the articles posted to them). A 

35 particular consequence of being able to group people by 
their interests is that one can form virtual communities of 
people of common interest, who can then correspond with 
one another via electronic mail. 
I claim: 

4Q 1. A method for providing a user with access to selected 
ones of a plurality of target object bulletin boards that are 
accessible via an electronic data transmission media, where 
said users are connected via user terminals and data com- 
munication connections to a server system which provides 
45 access to said electronic data transmission media, said 
method comprising the steps of: 
automatically generating target profiles for target object 
bulletin boards that are accessible by said electronic 
data transmission media, each of said target profiles 
50 being generated from the contents of an associated one 
of said target object bulletin boards; 
automatically generating at least one user target profile 
interest summary for a user at a user terminal, each said 
user target profile interest summary being generated 
55 from ones of said target object bulletin boards accessed 
by said user; and 
enabling access to said plurality of target object bulletin 
boards accessible by said electronic data transmission 
media by users via said target profile, comprising: 
60 automatically creating virtual communities of users of 
said target object bulletin boards, comprising: 
scanning bulletin board postings to existing target 

object bulletin boards, 
identifying groups of user identifications whose 
65 associated users have common interests, 

matching users with other like inclined users to 
create a new target object bulletin board. 
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2. The method for providing a user with access to selected automatically generating target proflles for bulletin 
ones of a plurality of target object bulletin boards of claim boards that are accessible by said electronic data trans- 

1, wherein said step of automatically creating further com- mission media, each of said target profiles being gen- 
prises: erated from the contents of an associated one of said 

dynamically creating electronic mailing lists for said users 5 bulletin boards. 

matched by said step of matching. 8. The method of operating a network-based agent of 

3. The method for providing a user with access to selected claim 7, wherein said step of automatically generating target 
ones of a plurality of target object bulletin boards of claim profiles comprises: 

2, wherein said step of automatically creating further com- generating a target profile comprising the cluster profile 
pnsts* for a cluster of documents posted on said bulletin 

automatically transmitting a notification to said users boards. 

matched by said step of matching to identify said new 9 ^ melhod of operating a network-based agent of 

target object bulletin board to said ones of said asso- daim 6 wherein ^ step of identifying a group of users 

ciated users. comorises- 

4. The method for providing a user with access to selected 15 _ 
ones of a plurality of target object bulletin boards of claim automatically generating at least one user target profile 
1, wherein said step of automatically creating further com- summary for a user at a user terminal, each said 
p^g. user target profile interest summary being generated 

continuing to enroll additional users in said new target M ^ ones f f « ! bulletin b * 

object bulletin board. 10 ^ melhod of OP^^S a network-based agent of 

5. A method for providing a user with access to selected claim 6 > wherein said step of automatically creating further 
ones of a plurality of target object bulletin boards that are comprises: 

accessible via an electronic data transmission media, where dynamically creating electronic mailing lists for said users 

said users are connected via user terminals and data com- 25 matched by said step of matching, 

munication connections to a server system which provides 11. The method of operating a network-based agent of 

access to said electronic data transmission media, said claim 10, wherein said step of automatically creating further 

method comprising the steps of: comprises: 

automatically generating target profiles for target object automatically transmitting a notification to said users 

bulletin boards that are accessible by said electronic 30 matched by said step of matching to identify said 

data transmission media, each of said target profiles proposed new bulletin board to said ones of said 

being generated from the contents of an associated one associated users. 

of said target object bulletin boards comprising: u ^ mctQod of operating a network-based agent of 

generating a target profile comprising the cluster profile daim ^ wherein ^ step 0 f automatically creating further 

for a cluster of documents posted on said new target 35 ^^^5^. 

object bulletin board, continuing to enroll additional users in said proposed new 

automatically generating at least one user target profile lletin board 

interest summary for a user at a user terminal, each " c 4 , „c 

said user target profile interest summary being gen- , 13 ™ thod <\ f f network-based agent of 

erated from ones of said target object bulletin b^rds 40 claun 6 > whcrem said sU * of matchm S <**P™* c 

accessed by said user; and identifying an existing bulletin board whose mean profile 

enabling access to said plurality of target object bulletin of lhe of messages recently posted therein is within 

boards accessible by said electronic data transmis- a threshold distance of the cluster profile of said 

sion media by users via said target profile. proposed new bulletin board. 

6. A method of operating a network-based agent to seek* 4S 14 ^ melhod of OP*™ 111 ^ a network-based agent of 
out users of a network with common interests, where said claim 13, further comprising the step of: 

users are connected via user terminals and data communi- automatically transmitting a notification to said users 

cation connections to a server system which provides access matched by said step of matching to identify said 

to an electronic data transmission media, comprising the existing bulletin board to said ones of said associated 

steps of: 50 users. 

dynamically creating bulletin boards for said users, com- 15- The method of operating a network-based agent of 

prising: claim 14, wherein said step of automatically transmitting a 

scanning bulletin board postings to existing bulletin notification comprises: 

boards, transmitting to said users matched by said step of match- 
identifying a group of users who have common 55 ing an indication at least one of the data comprising an 
interests, indication of common interest including: a list of titles 
matching users with other like inclined users in said of messages recently sent to the bulletin board, an 
identified group to create a proposed new bulletin introductory message provided by the bulletin board, a 
board . label that identifies the content of the cluster profile that 

7. The method of operating a network-based agent of 60 was used to identify the existing bulletin board, 
claim 6 wherein said step of scanning bulletin boards 

comprises: ***** 
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